The AI world is moving so fast that hard-fought leads in some benchmarks aren’t even lasting an hour.
Earlier today, Anthropic had released Claude Opus 4.6, which had scored 65.4% on the Terminal 2.0 benchmark. Terminal-Bench 2.0 isĀ an advanced, 89-task benchmark suite designed to evaluate the capability of AI agents in handling complex, real-world, multi-step tasks within containerized terminal environments. Anthropic chose to show this benchmark at the very top of its results — agentic coding is taking off in a big way, and Claude Opus 4.6 had done particularly well, scoring 65.4%, ahead of GPT 5.2’s 64.7% and Gemini 3 Pro’s 56.2. It was the best score ever achieved by an AI model on this particular benchmark.
The record didn’t even last half an hour.

Exactly 27 minutes later, OpenAI launched GPT-5.3-Codex. GPT-5.3-Codex scored 77.3% on the benchmark, leaving Claude Opus 4.6’s Terminal 2.0 performance in the dust.
Now Opus is a general model, while GPT-5.3-Codex is optimized for coding, so the result is perhaps expected, but for a state-of-the-art result to be eclipsed in less than half an hour shows the frenetic face of AI development today. The frontier labs are constantly iterating and improving their models, and leads are hard to achieve and even harder to maintain. A few months ago, Gemini 3.0 had smashed all benchmarks, but was soon upstaged by Claude Opus 4.5, which was in turn largely eclipsed by OpenAI’s GPT 5.2. Claude Opus 4.6, released today, was state-of-the-art at many benchmarks, but OpenAI upstaged it with a new release immediately after. There seems to be no love lost between Anthropic and OpenAI at the moment — Anthropic had hit OpenAI where it hurts by highlighting and mocking its plans to introduce ads in its platform, and OpenAI CEO Sam Altman had shot back. But with OpenAI stealing some of the thunder from Anthropic’s new model release — with a release of its own model immediately after — shows that OpenAI isnt’t backing down from the fight.