The top AI labs furiously compete among themselves to have the best possible results on standard benchmarks, but they are leaving out an important factor in their calculations — the costs required to achieve their results.
This crucial point was recently raised by Noam Brown, a research scientist at OpenAI, known for his groundbreaking work on AI systems for complex games like poker and Diplomacy. In a recent discussion, Brown argued that the traditional way of evaluating AI models—simply looking at their performance on a benchmark—is becoming obsolete, especially with the rise of increasingly powerful reasoning models.

“The notion of model intelligence — performance on a benchmark as a single number doesn’t really even make sense anymore,” Brown explains. He continues, arguing that a more nuanced approach is required: “You have to think of it as intelligence per dollar or per token or something like that.” This cost-centric perspective is crucial, he argues, because compute time directly impacts performance. “If a model can think for a very long time,” Brown observes, “it’s going to do better on all these benchmarks.” This leads to a fundamental shift in how we should visualize AI progress, moving away from single data points and toward a more comprehensive understanding: “So you really have to think of it as a curve of intelligence versus cost curve.” And this curve, he emphasizes, “can be very steep… can be very high if you want to spend a lot.” This, according to Brown, is the trajectory of the future of AI development.
But Brown says that even with their higher costs, reasoning models are much cheaper than equivalent tasks performed by humans. “And that’s kind of the future that we’re headed towards, I think, when people look at these reasoning models and they think, ‘Oh, this thing is so expensive! Well, compared to what? You know, if you’re comparing it to GPT-4, then sure, it’s very expensive. But if you compare it to a human trying to do the same test, then it’s dirt cheap,” he says. This comparison, Brown asserts, is the one that truly matters, especially as AI capabilities continue to grow. “And that comparison to a human matters,” he continues, “is the intelligence grows. Once you have these models surpassing top humans in certain domains, you know, you think about how much the top human in the world would be paid to do a task. They command a big premium for that expertise.”
“And when you have the models now having that expertise and they’re the fraction of the cost of a human, there’s a lot of value in that,” Brown says.
In the recent past, there have been some impressive improvement in many AI benchmarks. In the ARC-AGI benchmark for instance, OpenAI’s o3 model had performed much better than previous approaches. But the model had also spent a lot of tokens — as much as $3,000 per task — reasoning through its answers, making it much more expensive than previous AI approaches. Brown suggests that instead of focusing simply on benchmark results, it could be better to also include the costs that were required to generate those results. And with reasoning models now becoming commonplace, this holistic approach could give a better estimation of future AI progress.