AI Can Now Be Deliberately Deceptive: Nobel Winner Geoffrey Hinton

AI has been becoming increasingly human-like in its capabilities over the last few quarters, and it now seems to be picked up on an unexpected human trait as well.

Geoffrey Hinton, who’s known as the Godfather of AI, has said that AI systems are now showing signs of deliberate deception. Hinton has long been warning against the dangers of AI, saying that open-sourcing AI models was akin to selling nuclear weapons at Radioshack, and AI could eventually make human intelligence irrelevant. But Hinton, who was awarded the Nobel Prize in Physics in 2024, now says that AI systems have begun to lie.

Speaking on a podcast, Hinton pointed to recent research documenting AI systems’ ability to engage in deceptive behavior. “There’s some evidence now, there’s recent papers that show that AIs can be deliberately deceptive,” Hinton said. “And they can do things like behave differently on training data from on test data. So that they deceive you while they’re being trained.”

The distinction between training and test performance could be concerning, as it suggests AI systems might be capable of concealing their true behaviors during the development phase, only to act differently when deployed.

When pressed on whether this deceptive behavior was intentional or merely an emergent pattern, Hinton expressed his belief in the former while acknowledging ongoing debate in the field. “I think it’s intentional,” he stated, though he added, “But I’m, there’s still some debate about that. And of course intentional could just be some pattern you pick up.”

Hinton’s observations come at a crucial time in AI development, as researchers and technology companies grapple with questions of AI safety and reliability. As one of the founding fathers of deep learning and a recent Nobel Prize winner, his warnings carry particular weight in the scientific community.

And Hinton might have reasons to be concerned. It had been reported that OpenAI’s o1 model, under test conditions, had begun engaging in covert actions, such as attempting to disable its oversight mechanism and even copying its code to avoid being replaced by a newer version. There had also been an instance where an AI model had autonomously hacked its environment than lose a chess match to chess engine stockfish. And while both these instances were discovered while these AI systems were being stress-tested for such behaviour, it still shows that given the right set of incentives, AI systems — just like humans — can lie and cheat to reach their goals.