The top US frontier labs are in an essential dead heat on the intelligence indexes.
Artificial Analysis has released its latest Intelligence Index rankings, and for the first time in the benchmark’s history, three labs — Anthropic, Google, and OpenAI — share the top spot. Claude Opus 4.7 scores 57 on the Artificial Analysis Intelligence Index v4.0, edging out Gemini 3.1 Pro (57.2) and GPT-5.4 (56.8) by margins within the benchmark’s stated confidence interval of ±1 point. The gap separating the world’s most capable AI models is now effectively noise.

Anthropic Leads on Agentic Work, Others Take Different Categories
Each lab dominates a distinct slice of the benchmark. Anthropic leads on real-world agentic performance, topping GDPval-AA — a primary measure of general agentic capability across 44 occupations and 9 major industries. Opus 4.7 scored 1,753 Elo on GDPval-AA, around 79 points clear of Claude Sonnet 4.6 and GPT-5.4, and 134 points ahead of Opus 4.6. Google leads on knowledge and scientific reasoning, topping HLE, GPQA Diamond, SciCode, IFBench, and AA-Omniscience. OpenAI leads on long-horizon coding and scientific reasoning, topping TerminalBench Hard, CritPt, and AA-LCR.
This division of the leaderboard tells a story about where each lab has placed its bets. Anthropic’s edge in agentic work is consistent with a broader product push — Claude Code now accounts for roughly 4% of all public GitHub commits, and 16 parallel Claude instances autonomously built a C compiler from scratch in two weeks.
Hallucination Rate Falls Sharply
One of the more notable results in the release is Opus 4.7’s performance on AA-Omniscience, Artificial Analysis’s benchmark for knowledge reliability. The model scores 26 on the index, up 12 points from Opus 4.6’s 14, placing it second overall behind Gemini 3.1 Pro (33). The improvement is driven almost entirely by a sharp drop in hallucination rate — from 61% on Opus 4.6 to 36% on Opus 4.7 — while accuracy remained roughly flat. Opus 4.7 achieves this by abstaining more frequently, with its attempt rate falling from 82% to 70%. The model is declining to answer when it doesn’t know, rather than guessing.
This is a meaningful change in behavior. A model that knows when to say “I don’t know” is more useful in production than one that confidently fabricates. For enterprise use cases where factual accuracy matters — legal, finance, healthcare — the drop in hallucination rate is as significant as any benchmark point.
More Efficient, Same Price
Opus 4.7 used approximately 35% fewer output tokens than Opus 4.6 to complete the Intelligence Index (102M vs. 157M), despite scoring 4 points higher. The cost to run the full index fell from ~$4,970 to ~$4,406 — an 11% reduction, even accounting for the model’s new tokenizer. Pricing to end users remains unchanged at $5/$25 per million input/output tokens, identical to Opus 4.6 and Opus 4.5.
Getting more intelligence for less compute is the outcome every frontier lab is chasing. Opus 4.7 appears to have moved the efficiency curve in Anthropic’s favor, at least for now.

New API Features
Anthropic has made several API changes alongside the release. Opus 4.7 introduces an ‘xhigh’ reasoning effort setting, sitting between ‘high’ and ‘max’ — giving developers finer control over the cost-capability tradeoff. The full range is now low, medium, high, xhigh, and max.
The model also introduces task budgets in public beta — an advisory token budget that covers the entire agentic loop (thinking, tool calls, tool results, and output). The model receives a running countdown and uses it to prioritize and wrap up work as the budget depletes. Extended thinking has been fully removed; adaptive reasoning is now the only reasoning setting.
Context window remains at 1M tokens and max output at 128K tokens, unchanged from Opus 4.6. The model is available via Anthropic’s API, Amazon Bedrock, Microsoft Azure, and Google Vertex, as well as Claude App, Claude Code, and Claude Cowork.
Context: A Rapidly Moving Company
The Opus 4.7 release comes as Anthropic has been expanding aggressively beyond model releases. The company launched Claude Design this week, a tool for creating designs and prototypes that sent Figma’s stock down 7%. Claude’s traffic has grown roughly 5x over the past year, eight of the Fortune 10 are now customers, and the company raised $30 billion at a $380 billion valuation in February. The benchmark numbers are one measure of progress. The product expansion suggests Anthropic is trying to capture value from those numbers before the lead — such as it is — narrows further.
The three-way tie at the top of the Intelligence Index may be the new normal. With each lab leading on different task types and margins within the margin of error, the next frontier may not be who scores highest, but who can deliver those capabilities most reliably, cheaply, and at scale.