Claude Opus 4.8 appears to be the most capable model in the world at the moment.
Anthropic’s latest flagship leads the Artificial Analysis Intelligence Index v4.0 with a score of 61.4 — a clear margin above GPT-5.5’s 60.2 and Claude Opus 4.7’s 57.3. The index aggregates 10 evaluations, spanning GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, and CritPt. It is one of the most comprehensive AI capability snapshots available.

A Lead Built Across Multiple Domains
The 1.2-point gap over GPT-5.5 may look narrow, but it reflects consistent outperformance across a broad benchmark mix rather than dominance in a single category. According to Anthropic’s own evaluation data, Opus 4.8 leads on agentic coding with a SWE-Bench Pro score of 69.2%, compared to 58.6% for GPT-5.5. On Humanity’s Last Exam — multidisciplinary reasoning — it scores 57.9% with tools, ahead of all rivals. Agentic computer use (OSWorld-Verified) lands at 83.4%, also first.
Gemini 3.1 Pro Preview sits fourth at 57.2, followed by GPT-5.4 at 56.8. The gap between the top two and the rest of the field is meaningful. Below them, the mid-tier — Qwen3.7 Max, Gemini 3.5 Flash, Kimi K2.6, and MiMo-V2.5-Pro — clusters tightly in the 53–57 range, underscoring how competitive the broader landscape has become.
The GDPval-AA Signal
Perhaps the most telling result sits outside the composite index. On GDPval-AA, a benchmark that measures agentic performance on real-world work tasks using web and shell access, Opus 4.8 scores 1890 Elo — 121 points ahead of GPT-5.5 in second place. This benchmark is specifically designed to simulate the kind of economically valuable tasks that enterprise deployments face across 44 occupations and 9 major industries.

That gap matters. Composite index scores reflect breadth; GDPval-AA reflects what the model actually does when deployed. Anthropic’s consistent edge in agentic work is no accident — Claude Code now accounts for roughly 4% of all public GitHub commits, and the company has publicly demonstrated 16 parallel Claude instances autonomously building a C compiler from scratch.
There is one caveat worth flagging. Opus 4.8 still uses approximately 30% more turns per task than GPT-5.5 to reach its higher scores. For cost-sensitive enterprise deployments running high-volume agentic workflows, that efficiency gap is a real consideration, even if the output quality leads.
Price Holds, Speed Improves
What makes this release particularly significant for enterprise buyers: Opus 4.8 launches at the same price as Opus 4.7. Anthropic is also introducing Fast Mode — the same model running at approximately 2.5x the speed, priced at one-third the standard cost. Developers can activate it in Claude Code via the /fast command.
That combination — higher benchmark performance, unchanged pricing, and a faster low-cost variant — is a strong enterprise pitch. The trajectory of the Opus line has been steep: each generation arriving faster, scoring higher, and increasingly targeting agentic use cases over traditional prompt-response tasks.
Where Things Stand
The Artificial Analysis Intelligence Index has shifted quickly. Earlier this year, Anthropic, Google, and OpenAI were effectively tied at the top — scores separated by margins within the benchmark’s stated confidence interval. Opus 4.8 now opens a more definitive gap.
Whether that gap holds depends on what OpenAI and Google ship next. The mid-tier is crowded and improving fast. But right now, Claude Opus 4.8 sits at the top of the most comprehensive public AI ranking available — and the numbers across both aggregate and task-specific benchmarks support it.