Bullish On Environments And Agents, But Bearish On Reinforcement Learning Long-Term: Andrej Karpathy

Reinforcement learning is thought by many to be the next big thing in AI, but not everyone’s convinced of its potential.

Former Tesla Director of AI Andrej Karpathy has said that he is bearish on Reinforcement Learning long-term, though he does remain bullish on environments and agentic interactions. Karpathy highlighted the importance of currently building environments where AI systems could interact and learn, but also stressed that newer approaches could be required to replicate human-like intelligence.

“In era of pretraining, what mattered was internet text. You’d primarily want a large, diverse, high quality collection of internet documents to learn from,” Karpathy posted on X.

“In era of supervised finetuning, it was conversations. Contract workers are hired to create answers for questions, a bit like what you’d see on Stack Overflow / Quora, or etc., but geared towards LLM use cases,” he added.

“Neither of the two above are going away (imo), but in this era of reinforcement learning, it is now environments. Unlike the above, they give the LLM an opportunity to actually interact – take actions, see outcomes, etc. This means you can hope to do a lot better than statistical expert imitation. And they can be used both for model training and evaluation. But just like before, the core problem now is needing a large, diverse, high quality set of environments, as exercises for the LLM to practice against…Environments have the property that once the skeleton of the framework is in place, in principle the community / industry can parallelize across many different domains, which is exciting,” Karpathy said.

Karpathy said that while it was important to build environments, he wasn’t bullish on RL long-term because he believed that humans didn’t primarily learn through Reinforcement Learning, and different approaches could be needed to replicate human intelligence. “Personally and long-term, I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically. I think that reward functions are super sus(suspicious), and I think humans don’t use RL to learn (maybe they do for some motor tasks etc, but not intellectual problem solving tasks). Humans use different learning paradigms that are significantly more powerful and sample efficient and that haven’t been properly invented and scaled yet, though early sketches and ideas exist (as just one example, the idea of “system prompt learning”, moving the update to tokens/contexts not weights and optionally distilling to weights as a separate process a bit like sleep does),” he said.

This isn’t the first time that Andrej Karpathy has indicated that reinforcement learning might not be the holy grail to get to AGI. He’d said last month that he doesn’t expect Reinforcement Learning to be the full story in AI — he agreed that RL will produce some intermediate gains, but researchers will need to come up with new techniques to build truly human-like intelligencecs. In the past, he’s hinted that AI is missing a feature that updates the system prompt based on new learnings , which is closer to how humans pick up and learn new skills. It remains to be seen if RL indeed hits a wall after some incremental progress, but for now, even Andrej Karpathy agrees that it might be useful to build environments to scale RL to its full potential.