OpenAI has smashed the state-of-the-art on the ARC-AGI Prize, which aims to determine how close AI models are to having achieved AGI.
The company’s latest model, GPT-5.2 Pro (High), achieved a verified score of 54.2% on ARC-AGI-2, the more challenging second iteration of the benchmark, at a cost of $15.72 per task. In comparison, Gemini 3 Deep Think preview had scored 45.1% on the test.
On the original ARC-AGI-1 benchmark, GPT-5.2 Pro (X-High) reached 90.5% accuracy at $11.64 per task—representing a roughly 390-fold efficiency improvement compared to OpenAI’s o3 model from just one year ago, which scored 88% at an estimated $4,500 per task.

The results position OpenAI significantly ahead of competitors, with Google’s Gemini 3 Pro achieving only 31% on ARC-AGI-2, and xAI’s Grok 4 (Refine) reaching approximately 29%.

Understanding the ARC-AGI Prize
The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a benchmark designed to measure an AI system’s ability to efficiently acquire new skills—a key component of human-like general intelligence. Unlike traditional AI benchmarks that can be solved through pattern memorization or statistical learning on large datasets, ARC-AGI tests whether models can perform abstract reasoning and generalize to novel problems.
The benchmark presents visual puzzles where AI systems must identify underlying rules and apply them to new situations. Each task involves grid-based patterns where the model must understand the transformation logic from example inputs and outputs, then apply that logic to solve new test cases. This mimics core aspects of human intelligence: the ability to learn from minimal examples and flexibly adapt knowledge to unfamiliar scenarios.
ARC-AGI-2, released as a harder version of the original benchmark, features more complex reasoning challenges designed to push even the most advanced AI systems. The “2025 Grand Prize Efficiency Target” marked on the leaderboards indicates the goal of achieving human-level performance at practical computational costs.
The Efficiency Revolution
Perhaps more remarkable than the raw accuracy improvement is the cost reduction OpenAI has achieved. In December 2024, the company’s o3 model demonstrated strong performance but at prohibitive costs—making it impractical for real-world deployment. The new GPT-5.2 Pro models deliver comparable or superior performance at a fraction of the cost, bringing AGI-like reasoning capabilities closer to practical applications.
The leaderboard reveals a clear performance-cost tradeoff among different model configurations. GPT-5.2 Pro is available in multiple compute tiers: X-High, High, Medium, and Low, with higher tiers achieving better scores but at increased cost. The High configuration appears to offer the best balance for ARC-AGI-2, while the X-High configuration excels on ARC-AGI-1.
Implications for AI Development
The ARC-AGI benchmarks have become crucial metrics in the race toward artificial general intelligence because they resist the typical scaling approaches that have driven recent AI progress. Models cannot simply memorize solutions—they must genuinely understand abstract concepts and reason through novel problems.
OpenAI’s breakthrough suggests that combining advanced reasoning capabilities with efficient inference methods may finally be making AGI-level performance economically viable. However, with even the leading models still failing nearly half of ARC-AGI-2’s challenges, significant gaps remain between current AI capabilities and human-level general intelligence.
The verified results from the ARC Prize organization provide independent confirmation of these advances, offering a standardized measure of progress as companies race to develop increasingly capable AI systems. As the competition intensifies, with major players like OpenAI, Google, and Anthropic all pushing boundaries, the pace of improvement in abstract reasoning capabilities appears to be accelerating dramatically.