Google's Gemini 3 Model Makes 2X Jump Over Previous State-of-the-art on ARC-AGI Prize

Google’s Gemini 3 Model has topped most benchmarks, but it’s done particularly well on one that tracks how close models are to AGI.

Gemini 3, the latest iteration of Google’s flagship AI model has achieved a remarkable breakthrough on the ARC-AGI benchmark, effectively doubling the previous state-of-the-art performance. Gemini 3 Pro scored 31.11% on ARC-AGI-2, the benchmark’s more challenging semi-private evaluation, while the Gemini 3 Deep Think preview version reached an impressive 45.14%.

Understanding the ARC-AGI Prize

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a benchmark specifically designed to measure progress toward artificial general intelligence. Unlike traditional AI benchmarks that test memorization or pattern matching on tasks similar to training data, ARC-AGI evaluates a model’s ability to perform novel reasoning and abstract thinking—capabilities considered essential for true AGI.

Created by François Chollet, a researcher at Google and creator of the Keras deep learning framework, the ARC-AGI prize challenges AI systems with visual reasoning puzzles that require understanding core concepts like objects, counting, and basic physics. These tasks are deliberately designed to be easily solvable by humans but extremely difficult for current AI systems, making them a rigorous test of genuine intelligence rather than sophisticated pattern matching.

The benchmark comes in two versions: ARC-AGI-1, the original evaluation set, and ARC-AGI-2, a newer and more challenging semi-private evaluation designed to prevent overfitting and provide a harder test of reasoning capabilities.

Gemini 3’s Breakthrough Performance

Google’s Gemini 3 has demonstrated unprecedented performance on both versions of the benchmark. On ARC-AGI-1, Gemini 3 Pro achieved a 75% success rate at a cost of $0.49 per task, while the Deep Think variant reached an extraordinary 87.5% accuracy, though at a higher cost of $44.26 per task.

The results on the more challenging ARC-AGI-2 benchmark are perhaps even more significant. Gemini 3 Pro’s 31.11% score represents approximately double the performance of previous leading models, accomplished at a relatively modest cost of $0.81 per task. The Deep Think preview version, which appears to employ extended reasoning techniques, pushed this even further to 45.14%, establishing a new high-water mark for AI reasoning capabilities.

François Chollet himself acknowledged the significance of these results, stating: “Gemini 3 scores 31.1% on ARC-AGI-2. Impressive progress.”

The Cost-Performance Tradeoff

The leaderboard data reveals an interesting dynamic in the cost-performance tradeoff. While the Deep Think variant achieves substantially higher scores, it does so at approximately 100 times the cost per task compared to the standard Pro model. This suggests Google has implemented sophisticated reasoning techniques—possibly involving extended chain-of-thought processing or iterative problem-solving—that require significantly more compute resources.

For comparison, other leading models on the benchmark include OpenAI’s o3 preview (at various compute levels), Grok 4, and Claude models, but none have matched Gemini 3’s performance on ARC-AGI-2. The achievement represents not just an incremental improvement but a fundamental leap forward in AI reasoning capabilities.

What This Means for AGI Progress

While even Gemini 3 Deep Think’s 45% success rate on ARC-AGI-2 indicates we remain far from human-level performance on these reasoning tasks, the 2X improvement over previous state-of-the-art represents meaningful progress toward artificial general intelligence. The benchmark’s specific focus on novel reasoning—rather than tasks that might benefit from memorization or training data similarity—makes these gains particularly significant for the broader AI research community.

As the race toward AGI continues to intensify among major AI laboratories, benchmarks like ARC-AGI provide crucial objective measures of progress on the capabilities that matter most: the ability to reason abstractly, adapt to novel situations, and solve problems that weren’t part of training data. Google’s Gemini 3 has set a new standard on this critical dimension of AI capability.