Ilya Sutskever On Why The Economic Impact Of AI Models Is Currently Lagging Their Performance On Benchmarks

Former OpenAI Chief Scientist Ilya Sutskever has emerged out of his media hiatus — and he’s got an interesting theory on why AI’s current economic impact isn’t nearly quite as massive as the performance that these AI models seem to demonstrate on evaluation benchmarks.

Speaking on the Dwarkesh Podcast, Sutskever—who departed OpenAI in May 2024 and recently founded Safe Superintelligence Inc.—tackled one of the most puzzling contradictions in artificial intelligence today: the disconnect between impressive benchmark scores and underwhelming real-world results. His insights offer a rare glimpse into the technical limitations that may be preventing AI from delivering on its enormous promise, despite what appear to be breakthrough capabilities on paper.

The Benchmark Paradox

“This is one of the very confusing things about the models right now,” Sutskever explained. “How to reconcile the fact that they are doing so well on evals? You look at the evals and you go, ‘Those are pretty hard evals.’ They are doing so well. But the economic impact seems to be dramatically behind. It’s very difficult to make sense of how can the model, on the one hand, do these amazing things, and then on the other hand, repeat itself twice in some situation?”

He offered a concrete example that many developers will recognize: “Let’s say you use vibe coding to do something. You go to some place and then you get a bug. Then you tell the model, ‘Can you please fix the bug?’ And the model says, ‘Oh my God, you’re so right. I have a bug. Let me go fix that.’ And it introduces a second bug. Then you tell it, ‘You have this new second bug,’ and it tells you, ‘Oh my God, how could I have done it? You’re so right again,’ and brings back the first bug, and you can alternate between those. How is that possible? I’m not sure, but it does suggest that something strange is going on.”

Two Possible Explanations

Sutskever proposed two potential explanations for this phenomenon. “The more whimsical explanation is that maybe RL training makes the models a little too single-minded and narrowly focused, a little bit too unaware, even though it also makes them aware in some other ways. Because of this, they can’t do basic things.”

But his second explanation cuts to the heart of how modern AI systems are trained. “Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. When you do pre-training, you need all the data. So you don’t have to think if it’s going to be this data or that data.”

The problem, he suggested, emerges during reinforcement learning: “But when people do RL training, they do need to think. They say, ‘Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.’ From what I hear, all the companies have teams that just produce new RL environments and just add it to the training mix. The question is, well, what are those? There are so many degrees of freedom. There is such a huge variety of RL environments you could produce.”

The Eval Optimization Problem

Here’s where Sutskever’s theory becomes particularly intriguing. He suspects that AI companies may be inadvertently training their models specifically for benchmark performance rather than real-world utility:

“One thing you could do, and I think this is something that is done inadvertently, is that people take inspiration from the evals. You say, ‘Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?’ I think that is something that happens, and it could explain a lot of what’s going on.”

His conclusion is sobering: “If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance, which is something that we don’t today even understand, what we mean by that.”

Implications for the Industry

Sutskever’s observations arrive at a critical moment for the AI industry. Despite eye-popping valuations and breathless predictions of economic transformation, many businesses are still struggling to extract meaningful value from AI systems. Recent research has highlighted similar concerns about the gap between AI capabilities in controlled settings versus messy real-world applications.

The theory that models might be “overfitting” to benchmarks rather than developing robust, generalizable intelligence has profound implications. If true, it suggests that the path to artificial general intelligence may be longer and more complex than current benchmark results would indicate. It also raises questions about how the industry should evaluate progress—and whether the metrics we’re currently using are creating perverse incentives that prioritize impressive demos over practical utility.

For businesses considering major AI investments, Sutskever’s insights serve as a cautionary tale: benchmark performance may not translate to business value as directly as hoped. The industry may need to develop entirely new evaluation frameworks that better capture real-world performance—and AI labs may need to fundamentally rethink their reinforcement learning strategies to build models that can handle the messy, unpredictable nature of actual human needs.

As one of the architects of modern AI, Sutskever’s willingness to acknowledge these limitations publicly is both refreshing and concerning. It suggests that even at the highest levels of AI research, the gap between what we think these systems can do and what they actually accomplish remains poorly understood—a humbling reminder that the AI revolution may still have significant technical hurdles to overcome before it can deliver on its economic promise.