Claude Opus 4.8 Beats GPT 5.5 On GDPval-AA Benchmark For Real World Tasks

Claude Opus 4.8 appears to be a fairly strong model, at least based on initial evaluations. Anthropic’s latest flagship has debuted at the top of the GDPval-AA leaderboard — a benchmark that measures agentic performance on real-world work tasks using web and shell access — with an Elo score of 1890, pulling 121 points clear of GPT-5.5 in second place.

A Clear Margin At The Top

The GDPval-AA benchmark, developed by Artificial Analysis using its open-source Stirrup harness, is designed to simulate the kind of economically valuable tasks that enterprise deployments actually face. Opus 4.8’s score of 1890 at its ‘max’ effort setting represents a +137-point improvement over its predecessor, Opus 4.7, and translates to an implied win rate of approximately 67% in head-to-head comparisons against GPT-5.5 xhigh. That is a meaningful lead, not a rounding error.

GPT-5.5 had itself arrived with fanfare. When OpenAI launched it in April, it led on GDPval-AA with an Elo of 1769, roughly 30 points ahead of Opus 4.7. Opus 4.8 has reversed that decisively.

More Capable, But Still Turn-Heavy

The efficiency story is mixed. Opus 4.8 reaches its higher score in 15% fewer turns per task and with 35% fewer output tokens than Opus 4.7 — a significant internal improvement. But it still uses approximately 30% more turns than GPT-5.5 to complete the same tasks. On a scatter plot of score versus average turns per task, Opus 4.8 sits outside the “most attractive quadrant” — the upper-left zone where high Elo meets low turn count — while GPT-5.5 sits closer to that ideal. For cost-sensitive enterprise deployments, that gap is worth watching.

Claude Sonnet 4.6 (max) ranks fourth overall at 1676 Elo — solid, but well behind the top two. The rest of the leaderboard is tightly packed: DeepSeek V4 Pro, Qwen3.7 Max, and MiMo-V2.5-Pro all cluster in the 1547–1571 range, reflecting how competitive the mid-tier has become.

Context: A Fast-Moving Race

The GDPval benchmark has become one of the more closely watched evaluations precisely because it attempts to measure what AI actually does for businesses, not just what it can do in a lab. The frontier labs have been swapping the top spot on Artificial Analysis’ leaderboards with unusual regularity — GPT-5.5 dethroned Opus 4.7 barely six weeks ago. Opus 4.8 has now returned the favour.

Anthropic provided Artificial Analysis with pre-release access for benchmarking, which has become standard practice among the top labs. The broader Artificial Analysis Intelligence Index results for Opus 4.8 are still in progress.

What It Means

A 67% win rate against the second-ranked model is a strong result by any measure. The question for enterprise buyers is whether that performance premium justifies the higher turn count — and therefore higher inference cost — relative to GPT-5.5. For tasks where raw capability is the constraint, Opus 4.8 is currently the answer. For cost-optimised deployments at scale, GPT-5.5’s efficiency advantage may still matter.

The AI benchmark race shows no sign of slowing. Leads that looked durable six weeks ago have already been overturned. Opus 4.8 is at the top today — but the next release from any of the major labs could change that quickly.

Posted in AI