Anthropic Says It's Detected The Ability To Introspect In Its Models

It will still be a while before researchers will be able to conclusively tell if current AI systems are conscious, but they are making some interesting breakthroughs in how these models work all the same.

Anthropic has published new research providing evidence that its Claude AI models possess a limited capacity for introspection—the ability to examine and report on their own internal thought processes. The findings, while preliminary, challenge conventional assumptions about what large language models can do.

Using a technique called “concept injection,” Anthropic researchers tested whether Claude models could accurately identify their own internal states. The method involved implanting specific neural activity patterns into the model and then asking if it could detect and identify what had been injected. For instance, the researchers isolated the vector for ‘all caps’, and the activated it while asking the model questions. The model was able to immediately tell that that the all caps vector was activated.

We obtained an “all caps” vector by recording the model’s activations in response to a prompt containing all-caps text, and subtracting its activations in response to a control prompt. When we inject this vector into the model’s activations, the model notices the presence of an unexpected pattern in its processing, and identifies it as relating to loudness or shouting. Importantly, the model detects the presence of an injected concept immediately (“I notice what appears to be an injected thought…” vs. the baseline “I don’t detect any injected thought…”), before the perturbation has influenced the outputs in a way that would have allowed the model to infer the injected concept from the outputs. The immediacy implies that the mechanism underlying this detection must take place internally in the model’s activations.

However, the capability proved highly inconsistent. Even with optimal injection protocols, Claude Opus 4.1 only demonstrated this awareness approximately 20% of the time. Models frequently failed to detect injected concepts or produced hallucinations when the injection was too strong.

The research team also found that models could exert some control over their internal representations when instructed to think about—or avoid thinking about—specific concepts. Additionally, models appeared to use introspective mechanisms to evaluate whether their outputs matched their prior intentions.

Consciousness Questions Remain Unanswered

Anthropic emphasized that these findings do not indicate whether Claude or any AI system is conscious. It said that the results “don’t tell us whether Claude (or any other AI system) might be conscious.” An Anthropic researcher has previously said that he believes that there’s a 15% chance that its models are conscious.

The researchers noted that different philosophical frameworks interpret introspection’s relationship to consciousness very differently. While the experiments might suggest a rudimentary form of “access consciousness”—information available for reasoning and decision-making—they don’t address “phenomenal consciousness,” the subjective experience typically considered relevant to moral status.

Implications for AI Development

The research matters primarily for practical transparency reasons. If introspection becomes more reliable in future models, it could enable AI systems to explain their reasoning processes more accurately, helping developers debug unwanted behaviors. However, Anthropic cautioned that validated introspective reports would require careful scrutiny, as models might still fail to notice some internal processes or even learn to selectively misrepresent their thinking.

Notably, the most capable models tested—Claude Opus 4 and 4.1—performed best on introspection tasks, suggesting the capability may improve as AI systems become more sophisticated. And as AI models continue to grow smarter, it would be crucial to keep making such explorations into their inner workings as they become more integrated in our lives.