OpenAI Releases GPT 5.5, Tops Most AI Benchmarks

OpenAI has delivered a strong performance on benchmarks with its much-anticipated GPT 5.5 model.

The new model tops the Artificial Analysis Intelligence Index with a score of 60 — three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, which were both sitting at 57. The release ends a brief but notable period where OpenAI had failed to outright claim the top spot upon launch — GPT-5.4 had only managed a three-way tie.

GPT 5.5’s API pricing will be $5 per 1 million input tokens and $30 per 1 million output tokens, with a 1 million context window. This is double the price of GPT 5.4.

Benchmark Performance

On the task-specific benchmarks released by OpenAI, GPT-5.5 leads across the board:

GPT-5.5 Pro leads on BrowseComp (90.1%) and FrontierMath Tier 1–3 (52.4%), suggesting OpenAI’s enhanced variant is the go-to for web research and advanced mathematics.

The agentic and coding results are particularly significant. GDPval — which evaluates models on economically valuable real-world tasks — shows GPT-5.5 pulling nearly 5 points ahead of Claude Opus 4.7. Terminal-Bench 2.0, a measure of command-line task completion, sees an improvement of over 7 points from GPT-5.4. These are the categories enterprise customers tend to weight most heavily.

The Hallucination Problem

The one meaningful caveat is hallucination. On AA-Omniscience, GPT-5.5 posts the highest-ever accuracy at 57%, but carries an 86% hallucination rate — it answers confidently even when it’s wrong. Claude Opus 4.7 sits at 36%; Gemini 3.1 Pro Preview at 50%. For knowledge-intensive deployments in legal, finance, or healthcare, that gap matters.

Pricing and the Effort Ladder

OpenAI raised per-token pricing with GPT-5.5 to $5/$30 per million input/output tokens — double GPT-5.4’s rate. However, a roughly 40% reduction in output token usage keeps the net cost increase to around 20%. More importantly, OpenAI released the model across five effort levels (non-reasoning through xhigh), which creates a flexible cost profile. GPT-5.5 at medium effort matches Claude Opus 4.7 at roughly a quarter of the cost.

A Fast-Moving Leaderboard

The past several months have seen the frontier reshape repeatedly. Gemini 3.1 Pro claimed the top spot in February, followed by a three-way tie as GPT-5.4 and Claude Opus 4.7 arrived. Anthropic’s Claude Mythos Preview — still not publicly available — continues to post strong numbers on coding and reasoning benchmarks and remains a wildcard. For now, GPT-5.5 is the public frontier leader. Given the pace of releases, that may not last long.