Google Gemini 3.1 Pro Doubles Performance Over Gemini 3 Pro On ARC-AGI 2, Tops Benchmark

ARC-AGI 2 had been created when ARC-AGI 1 seemed all but saturated, but it appears that ARC-AGI 2 won’t remain unsolved for much longer.

Google DeepMind’s Gemini 3.1 Pro (Preview) has stormed to the top of both ARC-AGI leaderboards simultaneously, posting a 77.1% score on ARC-AGI-2 and a near-perfect 98.0% on ARC-AGI-1 — results that represent a dramatic leap over its predecessor and reframe the competitive landscape for frontier AI reasoning.

A Benchmark Built to Resist

The ARC-AGI-2 benchmark was purpose-built to outlast the first iteration of the challenge. When ARC-AGI-1 began to look like a solved problem — with models routinely cresting into the high nineties — the ARC Prize team raised the bar significantly, engineering tasks designed to resist brute-force pattern matching and demand more flexible, generalizable reasoning. For much of its early life, ARC-AGI-2 lived up to that ambition, with the best models struggling to clear even 50%. Gemini 3.1 Pro has now shattered that ceiling.

Gemini 3.1 Pro: The ARC-AGI Numbers

On ARC-AGI-2, Gemini 3.1 Pro scores 77.1% at a cost of $0.962 per task, well clear of the next best verified entries on the leaderboard. On ARC-AGI-1, it posts 98.0% at just $0.522 per task — making it both the highest-scoring and among the more cost-efficient performers at the top of that chart.

Gemini 3.1 Pro ARC-AGI 2

The performance on ARC-AGI-2 is particularly striking in context. The leaderboard shows most frontier models — including GPT-5.2 variants, Grok 4, and various Claude configurations — clustered well below the 60% mark at comparable or higher cost points. Gemini 3 Deep Think, a more computationally intensive offering, reaches around 85% but at significantly greater expense. Gemini 3.1 Pro’s position on the Pareto frontier of performance versus cost is, by the data, unambiguous.

Gemini 3.1 Pro ARC-AGI 1

Doubling Down on Efficiency

Google DeepMind has framed Gemini 3.1 Pro’s results explicitly around the Pareto frontier — the line that defines the best achievable performance at each cost level. Across both benchmarks, the model appears to push that frontier outward, delivering scores that no other verified model achieves at sub-dollar-per-task pricing. That framing matters commercially: enterprise and API customers evaluating reasoning models care as much about inference cost as raw capability, and a model that is both smarter and cheaper to run changes procurement calculus meaningfully.

What It Means for ARC-AGI-2

A 77% score does not mean ARC-AGI-2 is solved; the benchmark designers have consistently argued that human-level performance on these abstract visual reasoning tasks requires scores approaching 100%. But 77% is a qualitative inflection point. It demonstrates that the gap between frontier models and the benchmark’s upper bound is now a matter of refinement rather than fundamental capability.

The trajectory from ARC-AGI-1 saturation to ARC-AGI-2 dominance has arrived faster than many in the research community anticipated. If the pattern holds, the question is no longer whether ARC-AGI-2 will fall — but when, and which lab will get there first. Right now, Google holds the lead.

Posted in AI