RLHF Is Cr*p, It’s A Paint Job On A Rusty Car: Geoffrey Hinton

RLHF, or Reinforcement Learning from Human Feedback, is behind some of the recent advances in AI, but one of the pioneers of the field doesn’t think highly of it.

Geoffrey Hinton, often called the “Godfather of AI,” has some harsh words for RLHF, comparing it to a shoddy paint job on a rusty car. “I think our RLHF is a pile of crap,” Hinton said in a recent interview. “You design a huge piece of software that has gazillions of bugs in it. And then you say what I’m going to do is I’m going to go through and try and block each and put a finger in each hole in the dyke,” he said.

“It’s just no way. We know that’s not how you design software. You design it so you have some kind of guarantees. Suppose you have a car and it’s all full of little holes and rusty. And you want to sell it. What you do is you do a paint job. That’s what our RLHF is, a paint job,” he added.

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that helps build artificial intelligence (AI) models by integrating human feedback into the training process. Unlike traditional reinforcement learning, where AI agents learn through predefined reward functions, RLHF incorporates human preferences to refine the model’s behavior. This approach involves training a reward model using human evaluations of AI-generated outputs, which is then employed to optimize the AI’s decision-making. RLHF is particularly effective for complex tasks, such as natural language processing (NLP), where aligning AI outputs with nuanced human goals—like humor or factual accuracy—is challenging to define algorithmically.

Hinton’s critique suggests that RLHF is a superficial fix, addressing surface-level issues without solving the underlying problems. He argues that instead of patching individual flaws in a fundamentally broken system, the focus should be on designing AI systems with inherent guarantees of safety and reliability. The “gazillions of bugs” he refers to might include biases, inaccuracies, and unpredictable behaviors that emerge from the complexity of these models. The “paint job” of RLHF, in his view, merely masks these flaws without addressing the underlying rot.

This critique carries significant weight coming from someone of Hinton’s stature. He is one of the pioneers of Deep Learning, and won the Nobel Prize in 2024 for his contributions. Hinton’s critique highlights a growing concern within the AI community: that the current trajectory of AI development, while producing impressive results, might be building upon a shaky foundation. Meta’s AI Chief Yann LeCun has also been regularly saying that current AI approaches will not necessarily be able to model human intelligence, and will eventually plateau in capabilities. It remains how AI progress unfolds, but there are several dissenting voices which say that current techniques won’t get us very far.

Posted in AI