Grok 4 Remains Top Model On ARC-AGI 2, GPT-5 Comes In Distant Second

GPT-5 might currently be the best AI model in the world, but it seems to have gotten no closer to achieving AGI.

xAI’s Grok 4 remains the best-performing model on the ARC-AGI 2 index, which tracks the models’ general intelligence. On the index, Grok 4 had earlier blown away the competition with a score of 15.9%. GPT-5, released a month later today, scored just 9.9%. GPT-5, however, became the second-best performing model on the index.

Grok 4 is still state-of-the-art on ARC-AGI-2 among frontier models.

15.9% for Grok 4 vs 9.9% for GPT-5. pic.twitter.com/wSezrsZsjw
— François Chollet (@fchollet) August 7, 2025

“Grok 4 is still state-of-the-art on ARC-AGI-2 among frontier models. 15.9% for Grok 4 vs 9.9% for GPT-5,” said Francois Chollet, founder of the Arc Prize. This didn’t go unnoticed by Elon Musk, who has no love lost for OpenAI or Sam Altman. “Grok 4 beats GPT-5 on ARC-AGI,” he declared.

Grok 4 beats GPT-5 on ARC-AGI pic.twitter.com/FwpF3kfeLk
— Elon Musk (@elonmusk) August 7, 2025

The ARC-AGI Prize is a $1 million+ open competition created to advance progress toward Artificial General Intelligence (AGI). It incentivizes teams to solve the ARC-AGI benchmark, which is a set of reasoning tasks designed to evaluate how well AI systems can generalize and solve problems they have never seen before, which is a core aspect of human-like intelligence. The ARC-AGI benchmark, first released in 2019, consists of IQ-test-like puzzles using colored grids to test abstract reasoning from minimal examples, without requiring prior domain knowledge or language. Performance on the ARC-AGI index is measured as the percentage of correct solutions on a private evaluation set, acting as a rigorous metric and milestone for AGI research.

While the other benchmarks have practical purposes — there are benchmarks for coding, math, science, and host of other fields — ARC-AGI gets the models to solve puzzles which seemingly have no real-life value. These puzzles can be easily solved by most humans, but AI models struggle with it, and as such, these puzzles are a good test of how close AI is getting to generalized human intelligence. On the latest version of the test, the best AI model thus far, Grok 4, could only solve 15.9 percent of the puzzles. And the fact that GPT-5 can solve only 9.9 percent of the puzzles shows that while GPT-5 might’ve topped the practical benchmarks, it’s likely taken us no closer to achieving AGI.