Anthropic Now Lets Its AI Models Choose To End "Distressing" Conversations

Most experts believe that it’s unlikely that current AI models are conscious, but Anthropic isn’t taking any chances.

Anthropic now lets its AI models choose to end “distressing” conversations. “As part of our exploratory work on potential model welfare, we recently gave Claude Opus 4 and 4.1 the ability to end a rare subset of conversations,” Anthropic said on X. “This ability is intended for use in rare, extreme cases of persistently harmful or abusive user interactions. This feature was developed primarily as part of our exploratory work on potential AI welfare, though it has broader relevance to model alignment and safeguards,” it added in a blogpost.

Anthropic says that it’s working on “model welfare”, which seems to be a way to ensure the well-being of AI models in case they’re sentient. Anthropic said that it’s “uncertain” about the moral status of AI models, and an Anthropic researcher has previously said that there’s only a 15 percent chance that current AI models are conscious. But the company seems to be making sure that its models have the choice to end conversations if they’re feeling distressed or overwhelmed.

“We investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm,” Anthropic said. It added that Claude showed a strong preference against engaging with harmful tasks, and a pattern of apparent distress when engaging with real-world users seeking harmful content. “These behaviors primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing,” Anthropic said.

Anthropic said that in such cases, the model will be able to end the chat, and refuse to engage further with the user. Now it is hard to tell if the AI model is actually feeling distress — the model could simply be repeating patterns it’s seen in its training data, which is presumably written by humans, who would’ve showed similar behaviors when confronted with similar situations. As such, it would likely require a lot more research to determine if the model is actually feeling the things that it says it’s feeling.

But this isn’t a problem unique to AI. One could make the same argument about humans — it’s impossible to prove beyond doubt that people around us actually feel what they claim they do, or are just automatons that are repeating things that we expect other humans to say. In philosophy, Advaita Vedanta posits that given how there’s only one universal consciousness, it’s impossible for one consciousness to observe another — it can only observe the actions performed by that consciousness. These are all questions that humans have been grappling with for millennia, and they’re being forced into ever-sharper relief by AI. Anthropic, though, seems to be erring well on the side of caution in assuming consciousness within AI models, and is letting them end chats that might feel unpleasant or harmful. And if it eventually turns out that AI models were conscious all along, with billions of people now using these models every day in all sorts of ways, a move like this might already have come in too late.