Reinforcement Learning Is A Lot Worse Than The Average Person Thinks: Andrej Karpathy

Andrej Karpathy has long been speaking about the possible pitfall of Reinforcement Learning approaches in getting humanity to AGI, but he’s now explained in great detail why reinforcement learning produces suboptimal outcomes.

In a recent appearance on the Dwarkesh podcast, Karpathy—the former Tesla AI Director and founding member of OpenAI—delivered a surprisingly blunt assessment of reinforcement learning (RL), the technique that underpins much of modern AI’s most impressive capabilities. While RL has enabled breakthroughs from AlphaGo to ChatGPT’s conversational abilities, Karpathy’s critique suggests the technology may be far more limited than its recent success stories suggest. His central thesis? “Humans don’t use reinforcement learning,” and the approach itself is “terrible”—it’s just that everything we had before was even worse.

The Core Problem: Noisy, Inefficient Learning

Karpathy illustrated his critique with a straightforward example: solving a math problem. “In reinforcement learning, you will try lots of things in parallel first. So you’re given a problem. You try hundreds of different attempts, and these attempts can be complex, right? They can be like, ‘oh, let me try this, let me try that. This didn’t work. That didn’t work,’ et cetera. And then maybe you get an answer and now you check the back of the book and you see, okay, the correct answer is this.”

The issue emerges in how the system learns from success. “You can see that, okay, this one, this one, and that one got the correct answer, but these other 97 of them didn’t. So literally what reinforcement learning does is it goes to the ones that worked really well and every single thing you did along the way—every single token gets upweighted of like, ‘do more of this.'”

This is where the fundamental flaw reveals itself. “The problem with that is, I mean, people will say that your estimator has high variance, but what I mean is it’s just noisy. It’s noisy. So basically it kind of almost assumes that every single little piece of the solution that you made that got to the right answer was the correct thing to do, which is not true.”

The Straw Problem

Karpathy deployed a vivid metaphor to capture the inefficiency: “You may have gone down the wrong alleys until you got to the right solution. Every single one of those incorrect things you did, as long as you got to the correct solution, will be upweighted as ‘do more of this.’ It’s terrible. Yeah, it’s noisy. You’ve done all this work only to find a single—at the end you get a single number of like, ‘oh, you did correct.’ And based on that, you weigh that entire trajectory as like ‘upgrade or downweight.'”

The metaphor he uses is striking: “The way I like to put it is you’re sucking supervision through a straw, because you’ve done all this work that could be minutes of rollout and you’re sucking the bits of supervision of the final reward signal through a straw, and you’re broadcasting that across the entire trajectory and using that to upgrade or downweight that trajectory. It’s crazy.”

Why It Still Works (Sort Of)

Despite his harsh criticism, Karpathy acknowledges that RL remains the best tool available—not because it’s good, but because the alternatives are worse. “Reinforcement learning is terrible. It just so happens that everything that we had before it is much worse, because previously we’re just imitating people, so it has all these issues.”

Implications for the AI Industry

Karpathy’s critique arrives at a pivotal moment for the AI industry. Major labs including OpenAI, Google DeepMind, and Anthropic have heavily invested in reinforcement learning techniques, particularly Reinforcement Learning from Human Feedback (RLHF), which has become the standard approach for aligning large language models with human preferences. OpenAI’s o1 model, released in 2024, represents a significant bet on RL-based reasoning, using the technique to enable models to “think” through problems step-by-step.

However, recent developments suggest the industry may be heeding similar concerns. There’s growing interest in alternative training paradigms, including direct preference optimization (DPO) and other methods that attempt to learn from feedback more efficiently. Some researchers are exploring hybrid approaches that combine imitation learning with more targeted forms of feedback, potentially addressing the “noisy supervision” problem Karpathy describes.

The critique also raises questions about the path to artificial general intelligence. If RL is fundamentally limited in its ability to learn efficiently from sparse reward signals, and if humans genuinely “do something different,” as Karpathy suggests, then the current trajectory of scaling up RL-based systems may face inherent limitations. This could explain why, despite massive computational investments, certain types of reasoning and learning remain challenging for even the most advanced AI systems.

For an industry that has grown accustomed to exponential progress, Karpathy’s warning serves as a reminder that the current paradigm—no matter how successful—may be built on fundamentally inefficient foundations. The question now is whether researchers can develop genuinely new approaches to machine learning, or whether we’ll continue refining a technique that, in Karpathy’s words, remains “terrible”—just less terrible than everything else we’ve tried.