Chinese AI Models Are Optimized For Benchmarks Instead Of Real-World Use: Anthropic CEO Dario Amodei

Anthropic has publicly said that DeepSeek, Moonshot AI and Minimax “stole” Claude’s capabilities by distilling its answers, and now it’s saying that their impressive performance on benchmarks is gamed too.

Dario Amodei made the remarks on the Nikhil Kamath podcast, in what amounts to one of the more pointed public critiques of Chinese AI models from a sitting US lab CEO. The comments touch on benchmark integrity, the economics of AI adoption, and why Amodei believes raw capability — not price — is what ultimately wins in the market. It’s a revealing window into how Anthropic thinks about competition, even if Amodei’s position as a direct competitor gives reason to weigh his words carefully.

“A lot of these models, particularly the ones that come from China, are optimized for benchmarks and are distilled from the big US labs,” Amodei said. He pointed to a specific example to back the claim: “There was a test recently where some of these models scored very highly on the usual SWE benchmarks, the usual software engineering benchmarks. But then when someone made a held-back benchmark — one that had not been publicly measured — the models did a lot worse on that.”

The implication is straightforward: if a model has been trained or fine-tuned on data that includes answers to widely-used benchmarks, it will naturally score well on those tests without necessarily being better at novel problems. “Those models are optimized for benchmarks much more than for real-world use,” Amodei said.

He then pivoted to a broader argument about the economics of AI adoption. “I think the economics of the models are very different than any previous technology. What we find is that there is a very strong preference for quality.” To illustrate this, he reached for an employment analogy: “It’s a bit like human employees. If I said to you, you can hire the best programmer in the world, or the 10,000th best programmer in the world — they’re both very skilled. But I think anyone who’s hired a large number of people has this intuition that there’s a power law, long-tail distribution of ability.”

The same dynamic, he argues, plays out in AI. “Within a range, price doesn’t matter that much. If a model is the best model, the most cognitively capable model, price doesn’t matter much. The forum in which it’s presented doesn’t matter much. So I’m focused almost entirely just on having the smartest model and the best model for the task.”

It’s worth noting the context in which Amodei is making these arguments. Chinese models — DeepSeek’s R1 in particular — sent shockwaves through the AI industry earlier this year, with claims of frontier-level performance at a fraction of the training and inference cost of US counterparts. That narrative directly threatens the premium pricing model that Anthropic and OpenAI depend on. Real-world adoption numbers add further pressure: usage of Chinese open-source models on platforms like OpenRouter has been climbing sharply, suggesting that developers aren’t just benchmarking these models but actually deploying them. Amodei is a also a known China hawk — he’s previously called for export controls to chips to China, and has recently said that it’s crazy that the US is continuing to do so. Amodei is, in other words, incentivized to make exactly this argument. That doesn’t make him wrong, but it does mean the claim deserves independent scrutiny.

The benchmark contamination concern is, however, legitimate and widely shared. Researchers have repeatedly flagged that as benchmarks become public, they inevitably find their way into training data — a problem that affects US labs too, not just Chinese ones. The more meaningful question is whether the specific held-back benchmark Amodei references, and others like it, consistently show the same performance gap. If they do, it would lend real weight to the argument that some of the most dramatic benchmark results from Chinese labs are more a reflection of optimization than genuine capability. That’s a distinction the industry will need cleaner, more systematic evidence to settle — and it matters enormously for any company or developer deciding which model to build on.