Artificial Analysis Rejigs Intelligence Index, GPT 5.2 (xHigh) Takes Top Spot

The field of Artificial Intelligence is moving fast, and so are the benchmarks that are used to measure progress.

Artificial Analysis has released version 4.0 of its Intelligence Index, a comprehensive benchmark suite designed to assess the capabilities of frontier AI models. The updated index introduces three new evaluations and significantly raises the difficulty bar, with top-performing models now scoring 50 or below compared to 73 in the previous version.

OpenAI’s GPT-5.2 with “xhigh” reasoning effort leads the new rankings with a score of 50, followed closely by Anthropic’s Claude Opus 4.5 at 49 and Google’s Gemini 3 Pro at 48. The reshuffled index reflects a deliberate effort to reduce benchmark saturation and better align testing with real-world AI applications, particularly in agentic capabilities.

A More Demanding Assessment

The Intelligence Index v4.0 incorporates 10 evaluations spanning four equally weighted categories: Agents, Coding, Scientific Reasoning, and General intelligence. The update adds three new benchmarks—GDPval-AA, AA-Omniscience, and CritPT—while removing MMLU-Pro, AIME 2025, and LiveCodeBench to maintain relevance and differentiation across model tiers.

“We like nuance and breakdowns at Artificial Analysis, but when you want a single number, this is the best one,” the organization stated in announcing the update.

The revised methodology addresses a critical challenge in AI evaluation: as models improve rapidly, benchmarks can become saturated, making it difficult to distinguish between frontier systems. By incorporating harder evaluations, the new index maintains its ability to differentiate performance from smaller models through to the most advanced systems.

Real-World Economic Value Takes Center Stage

Perhaps the most significant addition is GDPval-AA, an evaluation based on OpenAI’s GDPval dataset that tests models on economically valuable tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities through Artificial Analysis’s reference agent framework called “Stirrup,” and are evaluated on their ability to produce realistic work products including documents, presentations, diagrams, and spreadsheets.

GPT-5.2 (xhigh) achieved an ELO rating of 1442 on GDPval-AA, with Claude Opus 4.5’s non-thinking variant scoring 1403. Claude Sonnet 4.5 placed at 1259 ELO, demonstrating a clear performance hierarchy among frontier models on practical business tasks.

The emphasis on agentic capabilities reflects the industry’s shift toward AI systems that can autonomously complete complex, multi-step workflows rather than simply responding to prompts.

Testing the Limits of Scientific Reasoning

CritPT, another new addition, presents an even more daunting challenge. Developed by over 50 active physics researchers from 30+ institutions, this benchmark consists of 71 research-level problems in modern physics, designed to simulate entry-level research projects comparable to assignments a principal investigator might give junior graduate students.

Current results reveal significant room for improvement: GPT-5.2 (xhigh) leads with just 11.5% accuracy, followed by Gemini 3 Pro Preview (high) and Claude 4.5 Opus (Thinking). The low scores underscore that even the most advanced AI systems remain far from reliably solving full research-scale scientific challenges.

The Hallucination Problem Persists

The third new benchmark, AA-Omniscience, measures factual recall and hallucination across 6,000 questions covering 42 topics in six domains. The evaluation uses a specialized scoring system that rewards precise knowledge while penalizing hallucinated responses, revealing an important trade-off: high accuracy doesn’t guarantee low hallucination.

Gemini 3 Pro Preview (high) topped the Omniscience Index with a score of 13, followed by Claude Opus 4.5 (Thinking) and Gemini 3 Flash (Reasoning) at 10 each. However, the breakdown revealed concerning patterns. While Google’s Gemini models achieved the highest accuracy rates (54% and 51%), they also exhibited the highest hallucination rates (88% and 85%).

Anthropic’s Claude models demonstrated a different profile, with lower accuracy but significantly reduced hallucination. GPT-5.2 (high) achieved 51% accuracy with the second-lowest hallucination rate, suggesting different optimization strategies among the leading AI labs.

Cost Considerations

While the Artificial Analysis Intelligence Index maps the intelligence of models, another useful metric to judge their capabilities is the cost required to run the benchmarks. GPT-5.2 (high) tops the Intelligence Index, but requires approximately $2,930 to run the complete evaluation suite—nearly $2,361 of which comes from reasoning costs. This contrasts sharply with more efficient models like Gemini 3 Flash or Claude 4.5 Sonnet, which cost under $1,000 to evaluate comprehensively.

The cost breakdown highlights an emerging tension in AI development: the most capable models often require extensive computational resources, raising questions about practical deployment for cost-sensitive applications.

Industry Implications

The Intelligence Index v4.0 update comes at a critical moment for the AI industry. As models approach human-level performance on many traditional benchmarks, the field needs more discriminating tests to guide development priorities and help enterprises make informed deployment decisions.

The tight clustering at the top—with just two points separating the leading three models—suggests the major AI labs are in fierce competition. OpenAI’s narrow lead with GPT-5.2, Anthropic’s strong showing with Claude Opus 4.5, and Google’s competitive positioning with Gemini 3 Pro — especially with its pricing — indicate that no single company has established a commanding advantage in general intelligence. And this rejigged index — with more headroom for models to improve — will track the progress of top models for 2026 as these companies look to jostle for leadership in the AI space.

Posted in AI