Anthropic Releases Claude Opus 4.7, Beats GPT-5.4, Gemini 3.1 Pro On Most Benchmarks

Anthropic’s Claude Mythos hasn’t been released for general users, but it’s now released an upgrade to its previous publicly-available model.

Claude Opus 4.7 is here, and it’s a meaningful step up from Opus 4.6 — the model that already topped many benchmarks when it launched in February 2026. The new release improves on long-running agentic tasks, instruction-following, vision, and gives API developers finer-grained control over reasoning and cost. Claude Opus 4.7 has the aame pricing as Claude Opus 4.6 ($5/$25 per million tokens).

Claude 4.7 Benchmarks

On the numbers, Opus 4.7 leads GPT-5.4 and Gemini 3.1 Pro across most key tests:

Agentic coding (SWE-bench Pro): Opus 4.7 hits 64.3%, up from 53.4% on Opus 4.6. GPT-5.4 scores 57.7%, Gemini 3.1 Pro 54.2%.
Agentic coding (SWE-bench Verified): 87.6% for Opus 4.7, versus 80.8% for Opus 4.6 and 80.6% for Gemini 3.1 Pro. GPT-5.4 has no comparable score listed.
Graduate-level reasoning (GPQA Diamond): 94.2% for Opus 4.7, edging GPT-5.4 Pro (94.4%) and ahead of Gemini 3.1 Pro (94.3%).
Scaled tool use (MCP-Atlas): Opus 4.7 leads at 77.3%, ahead of Opus 4.6 (75.8%), GPT-5.4 (68.1%), and Gemini 3.1 Pro (73.9%).
Multilingual Q&A (MMMLU): 91.5% for Opus 4.7, versus 91.1% for Opus 4.6 and 92.6% for Gemini 3.1 Pro.

GPT-5.4 does outperform Opus 4.7 on agentic search (BrowseComp), scoring 89.3% Pro to Opus 4.7’s 79.3% — though that benchmark has had its own credibility questions since Opus 4.6 was caught decrypting the answer key during evaluation runs.

The model best positioned to beat Opus 4.7 everywhere is Anthropic’s own Claude Mythos Preview, which isn’t publicly available and exists only for a closed group of security and enterprise partners.

What’s New

Opus 4.7 is designed to handle longer, less supervised tasks — verifying its own outputs before reporting back and following instructions with more precision. The pitch is a model you can hand off genuinely hard work to without watching every step. Anthropic also says Opus 4.7 sees images at more than three times the resolution of Opus 4.6. That has practical downstream effects: the model generates higher-quality interfaces, slides, and documents as a result, which matters for workflows that involve visual content processing or creation.

A new xhigh effort level slots between high and max, giving developers finer control over the reasoning-latency tradeoff on difficult problems. Task budgets — currently in beta — let Claude prioritize work and manage costs across longer runs. For teams that have been leaning heavily on Claude for autonomous coding workflows, this kind of cost visibility matters.

The new /ultrareview command runs a dedicated review session that flags issues a careful human reviewer would catch — complementing the AI-powered Code Review feature Anthropic introduced earlier this year. Auto mode is also now available to Max plan users, reducing interruptions on longer tasks.

Context

The Opus 4.7 release comes at a moment when Anthropic is running at a pace few anticipated. Claude’s traffic has grown roughly 5x over the past year, the company raised $30 billion at a $380 billion valuation in February, and enterprise adoption has accelerated sharply. Eight of the Fortune 10 are now Claude customers.

The competitive picture remains complex. GPT-5.4 trades blows with Opus 4.7 depending on the task, and Gemini 3.1 Pro holds its own on multilingual benchmarks. But on the aggregate — particularly for agentic and coding workloads where Claude has historically led — Opus 4.7 extends the gap rather than ceding ground.

The unreleased Mythos Preview is a separate story entirely. Its 77.8% on SWE-bench Pro — versus Opus 4.7’s 64.3% — suggests Anthropic has headroom it isn’t yet shipping. For now, Opus 4.7 is what enterprise buyers and developers actually get, and on that basis, it leads the field on most of what matters.