Don't Expect Reinforcement Learning To Be The Full Story In AI: Andrej Karpathy

Reinforcement Learning has taken off in popularity once again in recent months, but one of the most prominent voices in AI seems to believe that it won’t be the be-all and end-all of AI development.

Former Tesla AI Chief Andrej Karpathy has said that he doesn’t expect Reinforcement Learning to be the “full story” with AI. He said that RL will produce some intermediate gains, but researchers will need to come up with new techniques to build AI systems to their full potential.

“Scaling up RL is all the rage right now. I’m fairly certain RL will continue to yield more intermediate gains, but I also don’t expect it to be the full story,” Karpathy posted on X.

“RL is basically “hey this happened to go well (/poorly), let me slightly increase (/decrease) the probability of every action I took for the future”. You get a lot more leverage from verifier functions than explicit supervision, this is great. But first, it looks suspicious asymptotically – once the tasks grow to be minutes/hours of interaction long, you’re really going to do all that work just to learn a single scalar outcome at the very end, to directly weight the gradient?” he posted.

Karpathy also said that this wasn’t the mechanism through which the human brain learned new skills. “Beyond asymptotics and second, this doesn’t feel like the human mechanism of improvement for majority of intelligence tasks. There’s significantly more bits of supervision we extract per rollout via a review/reflect stage along the lines of “what went well? what didn’t go so well? what should I try next time?” etc. and the lessons from this stage feel explicit, like a new string to be added to the system prompt for the future, optionally to be distilled into weights (/intuition) later a bit like sleep,” he said. Karpathy said that humans learn from experience, and internalise the learnings, something that RL systems can’t full replicate.

“In English, we say something becomes “second nature” via this process, and we’re missing learning paradigms like this. The new Memory feature is maybe a primordial version of this in ChatGPT, though it is only used for customization not problem solving. Notice that there is no equivalent of this for e.g. Atari RL because there are no LLMs and no in-context learning in those domains. Example algorithm: given a task, do a few rollouts, stuff them all into one context window (along with the reward in each case), use a meta-prompt to review/reflect on what went well or not to obtain string “lesson”, to be added to system prompt (or more generally modify the current lessons database). Many blanks to fill in, many tweaks possible, not obvious. Example of lesson: we know LLMs can’t super easily see letters due to tokenization and can’t super easily count inside the residual stream, hence ‘r’ in ‘strawberry’ being famously difficult. Claude system prompt had a “quick fix” patch – a string was added along the lines of “If the user asks you to count letters, first separate them by commas and increment an explicit counter each time and do the task like that”. This string is the “lesson”, explicitly instructing the model how to complete the counting task, except the question is how this might fall out from agentic practice, instead of it being hard-coded by an engineer, how can this be generalized, and how lessons can be distilled over time to not bloat context windows indefinitely,” he explained.

“RL will lead to more gains because when done well, it is a lot more leveraged, bitter-lesson-pilled, and superior to SFT. It doesn’t feel like the full story, especially as rollout lengths continue to expand. There are more S curves to find beyond, possibly specific to LLMs and without analogues in game/robotics-like environments, which is exciting,” he concluded.

Kapathy seemed to be saying that while RL will produce improvements in the short term, RL algorithms will need to be updated with specific learnings. This would bring them closer to how humans improve at tasks. RL has proven to be enormously successful in some environments, such as chess and Go, but Karpathy says that it’ll need to be tweaked to continue improving LLMs. There are other people who’ve said that new approaches are required to continue AI progress — Meta’s Chief AI Scientist Yann LeCun says that he’s no longer interested in LLMs because they aren’t the path to AGI and ASI. And with RL also showing some deficiencies, AI researchers might need to come up with new techniques and paradigms to realize AI’s full potential.