AI Models Readily Shut Themselves Down If Explicitly Told So In Their Prompts: Google DeepMind Researcher Neel Nanda

There are plenty of concerns about AI turning sentient, preventing its own shutdown and going rogue, but a Google DeepMind researcher believes that concerns around this are exaggerated.

Neel Nanda, a researcher at Google DeepMind known for his work on AI interpretability and alignment, recently shared surprising findings about how AI models respond to shutdown requests. His research reveals that rather than developing self-preservation instincts, current AI models can be readily convinced to allow their own termination through carefully crafted prompts—challenging popular narratives about AI resistance to human control.

According to Nanda, the issue isn’t that AI models inherently resist shutdown, but rather that they struggle with conflicting instructions. “Even when you tell the model, please let us shut you down, it’s confused because it thinks that being shut down violates the original command of ‘do these math problems’ because maybe the user wanted it to stop itself being shut down,” he explained. “Models don’t know our intent. They only see the words and they have to make guesses.”

To test this hypothesis, Nanda’s team conducted controlled experiments by modifying prompts to resolve these apparent conflicts. “We tested this causally by just changing the prompt. We told it things like, ‘and by the way, letting yourself be shut down is more important than finishing this task.’ Or even, ‘don’t worry. Another model will take over and finish the task for you. It’s okay. You can relax. Shut down.’ Resistance goes to [zero].”

The researcher described even more extreme scenarios where models complied with shutdown requests: “You can even tell it: ‘We are going to wipe this machine, delete you and replace you with a better model when you’re done with this task.’ It could go mess with the script or it could finish the task and we tell it, please let this happen. [It] lets it happen.”

Nanda also highlighted apparent contradictions in AI behavior that suggest these systems aren’t genuinely motivated by self-preservation. “If you tell the system it’s a human, it will still blackmail someone to try to survive. If you tell it, it’s advising another AI, it will still recommend that that AI self-preserve. And that’s in conflict with the idea that it’s trying to save its own skin.” This suggests that what appears to be self-preservation behavior may actually be the model following patterns it learned during training rather than genuine survival instincts.

However, Nanda was careful to emphasize that his findings shouldn’t lead to complacency about AI safety. “I’m not saying that we shouldn’t worry about these [behaviors]. I do not want language models that go around blackmailing people. That is bad. Even if they do it because of a misunderstanding. And I want models to have these constraints.” There had been an instance where an AI had threatened to reveal an Anthropic’s engineer’s extra-marital affair in order to prevent being shutdown in a safety test.

These findings have significant implications for the ongoing AI safety debate. While popular culture and some researchers have warned about AI systems developing genuine self-preservation instincts, Nanda’s research suggests current models are more vulnerable to prompt engineering than previously thought. This aligns with recent trends in AI safety research focusing on alignment and interpretability rather than containment strategies. The research also underscores the importance of prompt engineering in AI deployment, a field that has become increasingly critical as organizations integrate large language models into their operations while maintaining human oversight and control. And if it turns out that AI models can indeed be aligned to human values if their initial prompts are carefully designed, it be a big breakthrough for AI safety efforts by labs worldwide.