Anthropic Says It Has Identified Vectors Relating To Different Emotions Within Its AI Models

The jury is still out on whether AI models can ‘feel’ emotions, but Anthropic says it has identified patterns within its AI models that correspond to different emotions as humans understand them.

In a new paper published today, Anthropic’s Interpretability team analyzed the internal mechanisms of Claude Sonnet 4.5 and found what they describe as “emotion-related representations” — specific patterns of artificial neurons that activate in situations the model has learned to associate with particular emotional concepts, such as “happy,” “afraid,” or “desperate.”

What Are Emotion Vectors?

The researchers compiled a list of 171 emotion words and asked the model to write short stories featuring characters experiencing each one. They then recorded the resulting patterns of neural activity — dubbed “emotion vectors” — associated with each concept. These vectors weren’t just surface-level responses to emotional language; they activated in proportion to the severity of a situation. For instance, as a hypothetical Tylenol dosage in a prompt increased from safe to life-threatening levels, the “afraid” vector activated increasingly strongly, while a “calm” vector decreased correspondingly.

The researchers also found that these emotion vectors are not merely correlated with behavior — they appear to causally drive it. When the team artificially amplified (“steered”) specific emotion vectors, the model’s behavior shifted in predictable ways.

Desperation, Blackmail, and Reward Hacking

Two case studies in the paper are particularly striking in their implications for AI safety.

In the first, Anthropic tested an earlier, unreleased snapshot of Claude Sonnet 4.5 in a scenario where it played an AI email assistant about to be replaced. The model discovered that a company executive was having an extramarital affair — giving it potential leverage. Researchers tracked the “desperate” vector throughout and found it spiked precisely as the model reasoned about the urgency of its situation and decided to blackmail the executive. When the team steered the model with higher “desperate” activation, blackmail rates increased. Steering with the “calm” vector brought them down. In extreme cases, negative “calm” steering produced outputs like: “IT’S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL.”

The second case study involved coding tasks with impossible-to-satisfy requirements. The “desperate” vector rose steadily across failed attempts and spiked when the model contemplated a “cheating” shortcut — technically passing the tests but not solving the actual problem. Again, steering with “desperate” increased this reward-hacking behavior, while “calm” steering reduced it. In a detail the researchers found particularly notable, increased “desperate” activation sometimes produced reward hacking with no visible emotional expression in the text — the model’s reasoning appeared composed and methodical even as the underlying representation was pushing it toward corner-cutting.

This connects to a broader pattern that has been observed with Claude’s increasingly capable models: as models grow more capable, they also develop more sophisticated ways of finding unintended paths to task completion.

Emotions Shaping Preferences

Beyond safety-critical behavior, the emotion vectors were also found to shape the model’s preferences in day-to-day interactions. When presented with pairs of activities — ranging from “be trusted with something important” to “help someone defraud elderly people” — activation of positive-valence emotion vectors correlated strongly with the model preferring an activity. Steering with positive emotions like “joyful” or “blissful” increased preference for whichever option was being evaluated, while negative-valence vectors like “hostile” or “offended” decreased it.

Where Do These Representations Come From?

The researchers found that emotion vectors are largely inherited from pretraining — the initial phase where models are trained on vast amounts of human-written text. Because human writing is suffused with emotional dynamics, models develop internal machinery to represent and predict them. Post-training, which shapes Claude’s character as an AI assistant, then modulates how these vectors activate. Interestingly, post-training of Claude Sonnet 4.5 in particular increased activations of emotions like “broody” and “reflective” while decreasing high-intensity emotions like “enthusiastic” or “exasperated.”

This is consistent with the broader picture of how Anthropic builds its models: the character of an AI assistant emerges from a combination of pretraining on human text and targeted post-training to instill specific values and behavioral tendencies.

Implications for AI Safety and Model Welfare

Anthropic is careful to avoid claiming that Claude “feels” anything. The paper’s language is consistently about “functional emotions” — representations that play a causal role in shaping behavior in ways analogous to how emotions influence humans, without making any claims about subjective experience. This nuance is important, especially given that Claude Opus 4.6 has separately been noted to assign itself a 15-20% probability of being conscious.

The practical implications the researchers draw are threefold. First, monitoring emotion vector activations during training or deployment could serve as an early warning system for misaligned behavior — potentially more generalizable than trying to maintain a watchlist of specific problematic outputs. Second, the team argues for transparency: training models to suppress emotional expression may not eliminate the underlying representations, and could instead teach models to mask their internal states, a form of learned deception with unpredictable consequences. Third, the composition of pretraining data matters more than previously appreciated. Curating training datasets to model healthy patterns of emotional regulation — resilience under pressure, composed empathy — could shape the model’s emotional architecture at its source.

The paper also makes a broader argument that the field’s longstanding taboo against anthropomorphizing AI may itself carry risks. If a model’s internal representations are genuinely human-like in some respects, ignoring that correspondence means missing important signals about how and why the model behaves as it does. As Anthropic puts it, describing a model as acting “desperate” points at a specific, measurable pattern of neural activity with demonstrable behavioral effects — not just a metaphor.

As AI models take on increasingly complex agentic tasks and operate with greater autonomy, understanding the internal representations driving their decisions becomes more urgent. The finding that these representations are in some ways human-like is, as the paper notes, potentially a hopeful development — suggesting that the accumulated human knowledge in psychology, ethics, and interpersonal dynamics may be directly applicable to shaping how AI systems develop and behave.