ARC-AGI-3 Released, Gemini 3.1 Pro Top Scores With Just 0.37 Percent

AI models are getting smart all the time, but it turns out that there are still relatively simple tasks they cannot do at all.

The ARC Prize Foundation released ARC-AGI-3 on March 24, 2026 — the third iteration of its benchmark designed to measure the gap between current AI and human-level general intelligence. The headline result is damning for frontier AI: humans solve 100% of the environments; the best AI model scores 0.37%.

What Is the ARC Prize?

The ARC Prize is an annual competition, co-founded by François Chollet and Mike Knoop, that awards prize money to teams that make meaningful progress on the ARC-AGI benchmarks. The 2024 competition offered over $1 million in prizes and drew 1,430 teams. ARC Prize 2026 raises the total prize pool to $2 million, split across an ARC-AGI-3 track and a final-year ARC-AGI-2 track, both hosted on Kaggle. Teams must open-source their solutions to claim prize money.

ARC-AGI-1 and 2: From Durable Benchmarks to Near-Saturation

ARC-AGI-1, introduced in 2019, tested fluid intelligence through grid-based visual puzzles. Each task required inferring a transformation rule from just a few input-output examples — no prior knowledge, no memorization shortcuts. It held up as a meaningful AI benchmark for five years. The first Kaggle competition in 2020 saw the best team manage only ~20% accuracy using brute-force program search.

The real breakthrough came in 2024, when test-time training — the practice of adapting a model at inference time on a specific task — pushed scores to 53.5% on ARC-AGI-1’s private set. OpenAI’s o3 then demonstrated that large reasoning models (LRMs) could exhibit genuine fluid intelligence on this benchmark, not just pattern-matching.

ARC-AGI-2 was introduced in March 2025 with harder, multi-step reasoning tasks. Early on, the best models barely cleared 15%. That gap closed fast: Gemini 3.1 Pro scored 77.1% on ARC-AGI-2 in February 2026, and Gemini 3 Deep Think reached 84.6% — approaching the benchmark’s practical ceiling. ARC-AGI-1, meanwhile, is essentially solved, with Gemini 3.1 Pro scoring 98%.

The benchmark designers also flagged a structural concern: frontier LRMs may be implicitly trained on ARC-AGI data. Evidence emerged from Gemini 3’s reasoning chain, which correctly referenced the integer-to-color mapping used in ARC-AGI tasks without being told what it was — strongly suggesting the benchmark’s data was well-represented in training.

With ARC-AGI-1 and 2 both approaching saturation, a new challenge was needed.

What ARC-AGI-3 Tests — and Why AI Fails at It

ARC-AGI-3 shifts from static puzzles to interactive, turn-based environments. An agent is dropped into a novel environment — a 64×64 color grid — and given no instructions whatsoever. No goal, no rules, no explanation. It must figure everything out: what it can do, what the win condition is, and how to get there efficiently.

The benchmark evaluates four capabilities: exploration (actively gathering information by interacting with the environment), modeling (building a mental map of how the environment works), goal-setting (inferring what to aim for without being told), and planning (executing a strategy and course-correcting as needed).

Scoring isn’t binary. ARC-AGI-3 uses a metric called RHAE (Relative Human Action Efficiency) — it measures how many actions an AI takes to complete each level, compared to the second-best human performance on the same level. The score is squared, meaning inefficiency is penalized sharply. An AI that takes 10× as many actions as the human baseline scores just 1% for that level, not 10%.

Humans clear these environments in a median of 7.4 minutes per session, and the benchmark was human-calibrated so that every included environment was independently solved by at least two participants from the general public. The environments use only Core Knowledge priors — objectness, basic geometry, basic physics — and deliberately avoid language, cultural symbols, or any concept that requires learned knowledge.

This is precisely why current AI struggles. LRMs are powerful at reasoning within domains they’ve been trained on. But ARC-AGI-3 environments are genuinely novel — not just variants of known games — and require an agent to form hypotheses, test them interactively, and revise its world model in real time. That exploratory loop, which humans execute instinctively, is something current models do poorly without being specifically engineered for it.

The benchmark also inverts the public-private data ratio compared to ARC-AGI-2. Only 25 environments are publicly available (down from a 10:1 public-to-private ratio in ARC-AGI-2), while 110 environments are held private — 55 semi-private for API testing, and 55 fully private for the official competition. This makes it much harder to train or overfit toward the benchmark.

The Current Leaderboard

At launch, no frontier model cleared even half a percent on the semi-private set:

ProviderModelScore
GoogleGemini 3.1 Pro Preview0.37%
OpenAIGPT 5.4 (High)0.26%
AnthropicOpus 4.6 (Max)0.25%
xAIGrok-4.20 (Beta, Reasoning)0.00%

For context: Gemini 3.1 Pro, the same model that dominates ARC-AGI-2 and tops the Artificial Analysis Intelligence Index, scores 0.37% here. The benchmark’s designers are explicit: this gap is not a matter of prompt tuning or scaffold engineering. In testing, a well-engineered harness targeting specific public environments could push Opus 4.6 to 97.1% on one environment — and 0% on a different one. Harness performance doesn’t generalize, which is why the official leaderboard will only count scores from general-purpose API calls with no task-specific preparation.

The ARC Prize Foundation says ARC-AGI-3 is the only unsaturated general agentic intelligence benchmark as of March 2026. Based on the launch results, it has plenty of runway left.

Posted in AI