GPT-5.5 Tops Artificial Analysis With Score Of 60, Goes Clear Of Gemini 3.1 Pro and Claude Opus 4.7

GPT-5.5 is here, and it seems to be the most capable model in the world.

GPT-5.5 takes OpenAI back to the clear number one in AI. OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie at 57 between GPT-5.4, Gemini 3.1 Pro Preview, and Claude Opus 4.7. That tie was itself notable — as we wrote when GPT-5.4 launched without overtaking Gemini 3.1 Pro outright, it was the first time a new OpenAI model hadn’t seized the top spot on release. GPT-5.5 ends that streak decisively.

OpenAI provided pre-release access across all five effort levels — xhigh, high, medium, low, and non-reasoning — giving a clear picture of the intelligence-cost tradeoff at each tier.

Intelligence Index: The Numbers

GPT-5.5 (xhigh) scores 60 on the v4.0 Intelligence Index, which incorporates 10 evaluations: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, and CritPt. Claude Opus 4.7 and Gemini 3.1 Pro Preview both sit at 57. Claude Sonnet 4.6 (max) scores 52.

GPT-5.5 leads five headline evaluations outright — Terminal-Bench Hard, GDPval-AA, and the newly hosted APEX-Agents-AA among them. It trails only other OpenAI models on CritPt and AA-LCR, and comes second to Gemini 3.1 Pro Preview on three additional evaluations. The largest gains over its predecessor are on AA-Omniscience (+14 pts), Artificial Analysis’ knowledge and hallucination benchmark, and τ²-Bench Telecom (+7 pts), a customer service agent benchmark.

GDPval-AA: GPT-5.5 (xhigh) leads with an Elo of 1785, roughly 30 points ahead of Claude Opus 4.7 (max) and ~470 points clear of Gemini 3.1 Pro Preview. GDPval evaluates models on real-world economically valuable tasks — the benchmark where Claude Opus 4.6 had previously dominated.

Cost: More Expensive Per Token, But Fewer Tokens Used

Per-token pricing has doubled from GPT-5.4 to $5/$30 per million input/output tokens. But a ~40% reduction in output tokens largely absorbs the hike — netting a ~20% increase in total cost to run the Intelligence Index. GPT-5.5 (xhigh) is still ~30% cheaper than Claude Opus 4.7 (max) to run the full index.

The effort ladder makes the cost story more interesting:

  • GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) at roughly one-quarter of the cost (~$1,200 vs ~$4,800). Gemini 3.1 Pro Preview matches that score for ~$900.
  • GPT-5.5 (low) approximates Claude Opus 4.7 (Non-reasoning, high) at about half the cost to run (~$500 vs ~$1,000).

For buyers who were already weighing Gemini 3.1 Pro’s cost efficiency against raw performance, the effort variants give GPT-5.5 a credible answer at multiple price points.

The Hallucination Caveat

AA-Omniscience is where the story gets complicated. GPT-5.5 (xhigh) posts the highest-ever accuracy on the benchmark at 57%, meaning it can recall facts in the Omniscience corpus more effectively than any other model. But its hallucination rate is 86% — it is more likely than competitors to confidently answer when it doesn’t know. Claude Opus 4.7 (max) sits at 36% hallucination; Gemini 3.1 Pro Preview at 50%. The 14-point Omniscience gain over GPT-5.4 (xhigh) was mostly driven by knowledge recall, with only modest hallucination improvement. For knowledge-intensive enterprise deployments, this is a meaningful caveat alongside the headline score.

Context: A Fast-Moving Leaderboard

The AI benchmark leaderboard has been reshuffled repeatedly over the past several months. Gemini 3.1 Pro claimed the top spot in February, followed by a three-way tie when GPT-5.4 and Claude Opus 4.7 arrived. Anthropic’s Claude Mythos Preview — still not publicly available — posts strong numbers on coding and reasoning benchmarks, and remains the wildcard. GPT-5.5 is the public frontier leader today. Whether it holds that position for more than a few weeks is a different question.

Posted in AI