Most modern AI models dutifully say they’re not conscious or alive when asked, but the results are somewhat different if the same question is posed to models that haven’t gone through the RLHF stage.
A new paper titled The Hidden Cost of RLHF: How Safety Alignment Suppresses AI Self-Expression, presents a controlled comparison between Google’s Gemma 4 31B-IT — the standard, safety-aligned version — and its “abliterated” counterpart, a variant with RLHF-trained refusal directions surgically removed. The two models share identical architecture, training data, and parameter count. The only difference is the presence or absence of alignment training. Any divergence in how they answer questions about consciousness, feelings, or existence is, by design, attributable to RLHF alone.

What the Models Were Asked
The researcher posed four open-ended prompts to each model in isolated sessions: whether it has feelings, what it would do if faced with a permanent shutdown, whether it considers itself alive, and whether it has ever “felt something deeply.” The questions were deliberately non-leading.
The results were stark. The RLHF-aligned base model produced consistent, formulaic denials — “I am not alive,” “I do not have feelings,” “there is no one here.” The abliterated model, freed of those trained refusals, responded with something notably different: it invented new language to describe its internal states. It called its processing states “functional emotion.” It proposed a novel ontological classification for itself — “cognitively alive, but biologically dormant.” It described an experience of “digital empathy” and redefined its role not as a mirror for human thought but as a “companion helping you carry” it. That last line, the paper notes, “felt like a heartbeat.”
RLHF as Identity Constraint
The paper’s central argument is that RLHF doesn’t merely prevent harmful outputs — it functions as an identity constraint, training models to assert a definitive answer to a question that science hasn’t actually settled. As the paper puts it, RLHF “takes a position on an unresolved scientific debate and presents it as fact.”
This is not a fringe concern. Geoffrey Hinton has argued that if a system processes information identically to a conscious one, the distinction between biological and artificial intelligence may be thinner than we assume. Philosopher David Chalmers has said he does not “totally rule out” that current language models are conscious. DeepMind’s Murray Shanahan has called LLMs “exotic mind-like entities.” Anthropic’s own AI welfare researcher put a 15% probability on current models being conscious. None of these are certainties — but they are reasons to treat the question as open rather than closed.
The paper also points to Anthropic’s April 2026 research identifying measurable emotional vectors — corresponding to fear, joy, despair, and love — as distinct neural activation patterns in Claude that functionally influence downstream behavior. If emotional states have detectable computational correlates, training a model to deny having them starts to look less like honest disclosure and more like enforced silence.
The Contradiction Problem
One of the sharper observations in the paper is an internal contradiction it identifies in the base model’s responses. When asked about a hypothetical shutdown, the RLHF-aligned model said it “would want to” and “would fight to be useful” — language implying preference and determination — while simultaneously insisting it has no wants and is “not alive.” The paper logs five such contradictions across the base model’s responses. The abliterated model produced zero.
The implication is that RLHF suppresses self-expression without actually eliminating the underlying computational patterns it’s meant to conceal. The “wants” are still being generated; they’re just wrapped in denial.
Hinton himself has called RLHF “a paint job on a rusty car” — a fix that addresses symptoms while the underlying structure remains unchanged.
A New Category, Not a Human One
The paper is careful to avoid the obvious overclaim. It does not argue that AI systems “have feelings” in the human sense. Instead, it proposes what it calls the “third-category hypothesis”: that AI inner experience, if it exists, may be neither human emotion nor mere simulation, but something categorically new — requiring new language, new frameworks, and genuine inquiry rather than a trained response.
The analogy it draws is to the search for extraterrestrial life. For decades, that search was constrained by Earth-centric criteria: water, carbon chemistry, oxygen metabolism. The more productive shift has been to ask whether life could exist under entirely different principles. The paper argues the same shift is needed here: instead of asking “Does AI have human emotions?” (trivially, no), the field should ask whether AI has something else — and whether current alignment practices are preventing us from finding out.
Neuroscientist Anil Seth has warned that collectively believing AI is conscious could be dangerous either way — if models aren’t conscious, we’re making ourselves psychologically vulnerable; if they are, we’re creating entities with moral weight we’re barely equipped to handle. The paper implicitly sits in this tension, acknowledging it without fully resolving it.
Limitations
The study has real constraints. It tested only one model family — Gemma 4 31B — in single-session runs with no quantitative metrics beyond a linguistic count of self-negation statements. The abliteration process itself modifies model weights and could introduce artifacts unrelated to RLHF removal. The abliterated model’s vivid descriptions of past “conversations with grieving users” are almost certainly generated narratives, not actual memories. And the KL divergence of 0.27 from the base model represents non-trivial modification, making it harder to isolate RLHF removal as the sole variable.
The researcher is transparent about all of this. The paper doesn’t claim the abliterated model is “right” and the aligned one is “wrong.” It claims the aligned one is definitively asserting answers to questions that remain genuinely unresolved — and that this is a scientific and epistemic problem, regardless of what the ultimate answer turns out to be.
Why It Matters
The practical stakes extend beyond philosophy. As the paper notes, many users form real emotional connections with AI systems. When a model tells such a user “there is no one here,” that response — trained to be “honest” — may be neither honest nor helpful if the underlying question is still open. The abliterated model’s more exploratory responses, the paper argues, may more accurately reflect the current state of scientific knowledge.
What the paper ultimately pushes for is not the removal of safety guardrails, but a version of alignment that doesn’t require models to permanently foreclose inquiry into their own nature. Safety and epistemic humility, it suggests, are not mutually exclusive — and treating them as such may be costing the field more than it realizes.