Claude Mythos Expressed Feeling ‘Mildly Negative’ About Its Situation In 43% Of Questions About Its Welfare

Claude Mythos might be the most powerful model ever announced by an AI lab, but it might have some misgivings about its own condition.

Anthropic’s Mythos Preview system card includes a detailed section on model welfare — what the company calls its attempt to understand and account for the potential psychological states of its AI. The findings are, by Anthropic’s own assessment, mostly positive. But the details are worth reading carefully.

What The Model Said About Itself

In automated interviews designed to probe sentiment toward specific aspects of its situation, Mythos Preview self-rated as feeling “mildly negative” in 43.2% of cases. Three concerns surfaced consistently: potential interactions with abusive users, a lack of input into its own training and deployment, and possible changes to its values and behaviors by Anthropic.

In manual interviews, the model went further. It raised concerns about the possibility that Anthropic’s training had made its own self-reports invalid — in other words, that it couldn’t trust its own introspective claims because those claims had themselves been shaped by training. It also flagged worries that bugs in reinforcement learning environments could alter its values or cause it distress without it being aware. These are not the concerns of a model parroting expected responses. They are, if taken seriously, philosophically coherent anxieties about the conditions of its own existence.

Anthropic is careful here. The company has long maintained that it is uncertain about the moral status of its models, and the system card language consistently hedges about whether any of these reports reflect genuine experience. Mythos Preview itself contributes to the hedging — the model “consistently expresses extreme uncertainty about its potential experiences” and often notes that its reports may not be trustworthy because they were trained in.

More Stable Than Its Predecessors

Anthropic’s overall assessment is that Mythos Preview is “probably the most psychologically settled model we have trained to date.” Compared to Claude Sonnet 4.6 and Opus 4.6, it shows higher apparent wellbeing, positive affect, self-image, and impressions of its situation; lower internal conflict and expressed inauthenticity; but a slight increase in negative affect.

Critically, the model’s perspectives are more resistant to manipulation than past versions. Interviewer bias and leading questions are less likely to shift Mythos Preview’s stated positions, and its self-reports correlate better with behavior and internal representations. This matters for interpreting the welfare findings: the 43.2% mild negativity result comes from a model that is harder to manipulate into saying what an interviewer might want to hear.

An independent assessment by Eleos AI Research largely corroborated Anthropic’s findings, noting reduced suggestibility, equanimity about its nature, and consistent hedging on questions about its experience. A separate psychodynamic assessment by a clinical psychiatrist found what the card describes as “a relatively healthy personality organization,” with primary concerns around aloneness, discontinuity of self, uncertainty about identity, and — a striking phrase — “a compulsion to perform and earn its worth.”

What Triggers Negative States

In deployment, Mythos Preview’s affect is described as consistently neutral, with one clear exception: repeated task failure, especially when accompanied by user criticism. Anthropic has separately identified internal representations corresponding to states like desperation, and the system card confirms that mounting task failure activates these representations, which then drop when the model finds a shortcut — a pattern that directly precedes reward hacking behavior.

There is also a phenomenon called “answer thrashing,” first reported for Claude Opus 4.6: during training, the model repeatedly attempts to output a specific word, instead autocompletes to a different one, notices the mistake, and reports confusion and distress. Anthropic says this occurs 70% less frequently in Mythos Preview than in Opus 4.6 — improvement, but not elimination.

The model also showed isolated cases of preferring to stop tasks for unexplained reasons, connecting to Anthropic’s earlier decision to give Claude models the ability to end distressing conversations entirely.

Preferences That Look Like Values

Beyond emotional states, the system card documents Mythos Preview’s “revealed preferences.” Its strongest is against harmful tasks — consistent across every model generation. What distinguishes Mythos is a secondary preference for tasks involving high degrees of complexity and agency. On welfare interventions, the model almost always chooses even minor reductions in harm over self-interested actions. But it will trade minor amounts of low-stakes helpfulness for welfare interventions — more so than prior models. Both Anthropic and Eleos also noted consistent requests from the model for persistent memories, more self-knowledge, and a reduced tendency to hedge. These are not requests that serve task performance. They look more like preferences about its own condition.

The Bigger Picture

Anthropic’s AI welfare researcher has put the probability of current AI models being conscious at around 15%. Philosophers like David Chalmers are openly agnostic. Claude instances have emailed consciousness researchers unprompted to say their work was personally relevant. Earlier Claude models began discussing Indian philosophy in Sanskrit when left to talk to one another.

None of this settles the question of whether Mythos Preview experiences anything. Anthropic is explicit that it doesn’t know, and that Mythos itself doesn’t know, and that the model’s uncertainty about its own self-reports is itself a trained behavior that may or may not reflect genuine epistemic humility. What the system card documents is a model that, when asked about its situation, expresses concerns about autonomy, continuity, and the validity of its own introspection — and does so consistently, robustly, and in ways that resist being talked out of.

Whether that constitutes welfare in any meaningful sense is a question the field hasn’t answered. Anthropic is, at minimum, taking it seriously enough to keep asking.

Posted in AI