GPT 5.5 Tops ARC-AGI 2 With 85% Score

GPT 5.5 has created a new top score on the ARC-AGI 2 benchmark.

OpenAI’s GPT-5.5, using chain-of-thought reasoning, has achieved a verified score of 85.0% on ARC-AGI-2 at a cost of $1.87 per task — displacing Gemini 3.1 Pro (Preview), which had held the top spot at around 77%, and Gemini 3 Deep Think, which previously topped the benchmark at 84.6%.

The Numbers

GPT-5.5’s ARC-AGI-2 results across compute tiers:

TierScoreCost/Task
Max (xHigh)85.0%$1.87
High83.3%$1.45
Medium70.4%$0.86
Low33.0%$0.35

On the older ARC-AGI-1, which is now effectively saturated at the top, GPT-5.5 scores 95.0% at $0.73 per task on the xHigh setting — competitive, though Gemini 3.1 Pro posts 98.0% at just $0.52 per task.

Context

ARC-AGI-2 was designed specifically to resist the brute-force pattern matching that pushed ARC-AGI-1 scores into the high 90s. The benchmark presents novel visual grid puzzles where models must infer transformation rules from a handful of examples and apply them to unseen cases — a test of generalization rather than recall. Early frontier models barely cleared 15% on ARC-AGI-2; the ceiling has closed fast.

OpenAI’s trajectory on the benchmark tells its own story. GPT-5.2 Pro set a record of 54.2% in late 2025. GPT-5.5 now takes that to 85%, a 30-point jump in a single model generation, and does so at a fraction of the cost — $1.87 per task versus GPT-5.2 Pro’s $15.72.

The ARC Prize team has noted that ARC-AGI-3 evaluations — which shift to interactive, turn-based environments with no instructions and where the best current model scores under 0.4% — are more compute-intensive to run, and full evaluations of GPT-5.5 on that benchmark are pending.

What It Means

With ARC-AGI-2 now approaching saturation — GPT-5.5 and Gemini 3 Deep Think both clearing 84-85% — the benchmark’s designers are likely to accelerate the transition to ARC-AGI-3. The real frontier question is no longer whether models can solve static visual puzzles at high accuracy, but whether they can explore, model, and plan inside novel environments with zero guidance. On that test, every frontier model currently fails near-completely.

For OpenAI, the result re-establishes the company at the top of the most closely watched reasoning benchmark after a period in which Google’s models had held the lead. The cost efficiency gains are also notable: delivering 85% ARC-AGI-2 performance at under $2 per task makes high-end reasoning measurably more accessible.

Posted in AI