Anthropic Co-founder Reveals Why Their Models Are Much Better At Coding Than Those Of Rivals

The AI race is close, with many companies vying for users’ attention, but developers seem to largely agree that Anthropic has released some of the best coding models over the last year or so. And Anthropic’s co-founder Tom Brown has now revealed why this is the case.

Anthropic co-founder Tom Brown says that their models are more popular than those of rivals at coding because they don’t focus on external benchmarks, which he implied can be gamed. Instead, they have internal benchmarks that they use to judge their models. Also, Brown said that they use their models internally at Anthropic to code, which helps their teams create better models.

”If you poll the YC founders, they prefer using Anthropic’s models for coding by like a huge margin,” Brown was asked on a podcast. “(The preference) is much larger than what you would predict if you just looked at the benchmark results. So there, there seems to be some X factor that makes people really like these models for coding. Do you know what it is and is it intentional in some way, or it just came out of the black box somehow?” the interviewer asked.

“I think that the benchmarks are easy to game,” Brown replied. “I think that all the other big labs have teams where their whole job of the team is to make the benchmarks scores good. We don’t have such a team. And so I think that is probably the biggest factor there,” he added.

“We don’t teach to the test,” Brown continued. “Because I do feel like if you start doing that, then it has weird incentives. Maybe we could like put that team under marketing or something, and then ignore all the benchmarks. But I think that that’s one reason why there’s some train tests mismatch there,” he said.

Brown said that they didn’t ignore benchmarks entirely — but they focused on their own secret internal benchmarks. “We have internal benchmarks. But we don’t, we don’t publish them. We have internal benchmarks that the team focuses on in improving, and then we also have a bunch of tasks — I think that accelerating our own engineers is like a top, top priority for us too. And so we do a ton of like dogfooding there to make sure that it’s helping with our folks too,” Brown added.

There are several AI coding benchmarks including SWE-bench, LiveCodeBench and Alder Polyglot, which test models on a variety of coding functions. Each time a model is released, companies highlight these benchmark results, and these results are also closely tracked by the AI community — developers try out new models if they perform well on these benchmarks. But Anthropic says that companies now are optimizing their models to score well on the benchmarks, which means that while their benchmark scores are high, they don’t perform nearly as well as in the real world. Anthropic, he says, avoids this by not focusing on external benchmarks but on its private internal benchmarks, and trying to make its models as useful as possible for its own engineers. It’s an interesting strategy, and thus far seems to have served Anthropic well — in a competitive field, its coding models are clearly some of the most popular out there.