Google Releases Gemini 3 Deep Think, Tops ARC-AGI 2 Benchmark With 84.6%

ARC-AGI 2 — an iteration on the original ARC-AGI benchmark which was designed to test for AGI — appears to be close to getting saturated.

Google DeepMind has unveiled a major upgrade to its Gemini 3 family with the enhanced Gemini 3 Deep Think mode, positioning it as a breakthrough in advanced AI reasoning capabilities. This specialized mode, designed for tackling the most demanding scientific, research, and engineering challenges, delivers unprecedented performance across several key benchmarks.

“We’ve upgraded our specialized reasoning mode Gemini 3 Deep Think to help solve modern science, research, and engineering challenges – pushing the frontier of intelligence,” Google DeepMind said on X.

On ARC-AGI-2 — a challenging benchmark emphasizing abstract reasoning, adaptability, and core intelligence without relying on memorized patterns — Gemini 3 Deep Think achieves a verified score of 84.6%. This significantly outperforms competitors, including Gemini 3 Pro Preview at 31.1%, Claude Opus 4.6 (Thinking Max) at 68.8%, and GPT-5.2 (Thinking xhigh) at 52.9%. The result, verified by the ARC Prize Foundation, highlights substantial progress toward saturating this once-formidable test of general intelligence.

Gemini 3 Deep Think’s results were verified by ARC-AGI. “New SOTA result on ARC-AGI 2,” it posted on X. Gemini 3 Deep Think (2/26) Semi Private Eval – ARC-AGI-1: 96.0%, $7.17/task – ARC-AGI-2: 84.6% $13.62/task,” it added.

On the original ARC-AGI 1 benchmark, Gemini 3 Deep Think did even better, scoring 96% and essentially all but saturating the benchmark.

In academic reasoning, Gemini 3 Deep Think scores 48.4% on Humanity’s Last Exam (no tools), surpassing Gemini 3 Pro Preview (37.5%), Claude Opus 4.6 (40.0%), and GPT-5.2 (34.5%). This benchmark, often described as one of the toughest evaluations of PhD-level knowledge across disciplines, underscores the model’s potential as a powerful assistant for researchers handling complex, interdisciplinary problems.

For coding and algorithmic prowess, Gemini 3 Deep Think attains an impressive Elo rating of 3455 on Codeforces (no tools), well ahead of Gemini 3 Pro Preview (2512) and Claude Opus 4.6 (2352). This demonstrates elite-level performance in competitive programming, where solving novel, time-constrained algorithmic challenges is essential.

In multimodal understanding, Gemini 3 Deep Think leads on MMMMU-Pro with 81.5%, edging out Gemini 3 Pro Preview (81.0%), Claude Opus 4.6 (73.9%), and GPT-5.2 (79.5%). This reflects strong capabilities in reasoning across text, images, and other modalities — crucial for real-world applications like scientific analysis and engineering design.

Detailaed Gemini 3 Deep Think benchmarks

These results stem from DeepMind’s methodology, which emphasizes enhanced reasoning chains, parallel hypothesis exploration, and inference-time optimizations in Deep Think mode. The mode excels in scenarios requiring deep, iterative thought rather than quick pattern matching.

The rollout targets high-end users and enterprises. Google AI Ultra subscribers can access the upgraded Deep Think directly in the Gemini app. For broader experimentation in research and development, an early access program via Vertex AI is now available, allowing qualified users to integrate the model through the Gemini API.

With Gemini 3 Deep Think, Google DeepMind reinforces its push toward AI systems that not only match but exceed human-level performance in specialized reasoning domains, paving the way for accelerated discovery in science and technology. As benchmarks like ARC-AGI-2 approach saturation, the focus shifts to practical, real-world impact — an area where this release aims to deliver immediate value.