Claude Opus 4.5 Beats GPT-5.1 On Artificial Analysis Intelligence Index, Trails Google’s Gemini 3

Claude Opus had topped coding and agentic use benchmarks when it was released, and now it’s become the second most capable model on the Artificial Analysis Intelligence Index.

Anthropic’s latest flagship model, Claude Opus 4.5, has secured the runner-up position with a score of 70, tying OpenAI’s GPT-5.1 (high) and trailing only Google’s Gemini 3 Pro, which leads at 73 points. The release marks a significant intelligence upgrade for Anthropic’s model family. Claude Opus 4.5 delivers a 7-point improvement over Claude Sonnet 4.5 and an 11-point jump from its predecessor, Claude Opus 4.1, establishing a new performance ceiling for the company’s offerings.

Pricing Strategy Shifts the Economics

Anthropic has dramatically restructured its pricing model, cutting per-token costs by roughly two-thirds. Claude Opus 4.5 is priced at $5 per million input tokens and $25 per million output tokens, compared to $15/$75 for Claude Opus 4.1. This positions the new model much closer to the mid-tier Claude Sonnet 4.5 ($3/$15 per million tokens) while delivering substantially higher intelligence in thinking mode.

However, the headline price cuts don’t tell the complete story. Claude Opus 4.5 consumed approximately 60% more tokens to complete Artificial Analysis’s Intelligence Index evaluations that Claude 4.1 Opus—48 million tokens versus 30 million for its predecessor. This increased token consumption means the actual cost to run these evaluations dropped from $3,100 to $1,500, a meaningful but less dramatic reduction than the per-token pricing suggests.

Despite this, Claude Opus 4.5 remains among the more expensive models to operate at scale, costing more to run the Intelligence Index than Gemini 3 Pro (high), GPT-5.1 (high), and Claude Sonnet 4.5 (Thinking), though it edges out Grok 4 (Reasoning) on cost efficiency.

Token Efficiency as a Competitive Advantage

A key differentiator for the Claude models remains that they are substantially more token-efficient than all other reasoning models. Claude Opus 4.5 has significantly increased intelligence without a large increase in output tokens, differing substantially from other model families that rely on greater reasoning at inference time (i.e., more output tokens). On the Output Tokens Used in Artificial Analysis Intelligence Index vs Intelligence Index chart, Claude Opus 4.5 (Thinking) sits on the Pareto frontier.

The model used 48 million output tokens to complete the Intelligence Index evaluations—substantially fewer than Gemini 3 Pro (high) at 92 million, GPT-5.1 (high) at 81 million, and Grok 4 (Reasoning) at 120 million tokens. This output token efficiency contributes to Claude Opus 4.5 (in Thinking mode) offering a better tradeoff between intelligence and cost to run the Artificial Analysis Intelligence Index than Claude Opus 4.1 (Thinking) and Grok 4 (Reasoning).

While Claude Opus 4.5 is significantly more token efficient than nearly all other reasoning models, it did use approximately 50% more tokens than Claude Opus 4.1. Given its relatively high pricing, Claude Opus 4.5 is amongst the most expensive models to run the Artificial Analysis Intelligence Index at roughly $1,500.

Strongest Performance in Coding and Agentic Tasks

The intelligence gains are most pronounced in areas critical to enterprise AI applications. Compared to Claude Sonnet 4.5 (Thinking), Claude Opus 4.5 shows substantial improvements in coding and agentic tasks: a 16 percentage point increase on LiveCodeBench, 11 points on Terminal-Bench Hard, 12 points on τ²-Bench Telecom, 8 points on AA-LCR, and 11 points on Humanity’s Last Exam.

Claude Opus 4.5 achieves Anthropic’s highest scores across all 10 benchmarks in the Intelligence Index. Notably, it earned the top score among all models on Terminal-Bench Hard at 44%, and tied with Gemini 3 Pro on MMLU-Pro at 90%. On CritPt, a frontier physics evaluation designed to test research assistant capabilities, Claude Opus 4.5 scored 5%, trailing only Gemini 3 Pro (9%) and tying GPT-5.1 (high).

Knowledge and Hallucination Performance

Claude Opus 4.5 (Thinking) takes the #2 spot on the Artificial Analysis Omniscience Index, a new benchmark for measuring knowledge and hallucination across domains. Claude Opus 4.5 (Thinking) comes in second for both Omniscience Index (the lead metric that deducts points for incorrect answers) and Omniscience Accuracy (percentage correct), offering a balance of high accuracy and low hallucination rate compared to peer models.

The model achieved a score of 10 on the Omniscience Index, behind only Gemini 3 Pro Preview at 13, and ahead of Claude Opus 4.1 (Thinking) at 5 and GPT-5.1 (high) at 2. It posted the second-highest accuracy at 43% while maintaining the fourth-lowest hallucination rate at 58%, trailing only Claude Haiku (Thinking) at 26%, Claude Sonnet 4.5 (Thinking) at 48%, and GPT-5.1 (high).

Anthropic emphasized that Claude Opus 4.5 demonstrates lower hallucination rates than select other frontier models including Grok 4 and Gemini 3 Pro, reinforcing the company’s focus on AI safety.

Non-Reasoning Mode Leads the Pack

In non-reasoning mode, Claude Opus 4.5 scores 60 on the Artificial Analysis Intelligence Index, making it the most intelligent non-reasoning model available. It surpasses Qwen3 Max (55), Kimi K2 0905 (50), and Claude Sonnet 4.5 (50) in standard inference mode.

Availability and Technical Specifications

Claude Opus 4.5 features a 200,000-token context window and supports up to 64,000 output tokens. The model is available through Anthropic’s API, Google Vertex AI, Amazon Bedrock, and Microsoft Azure. It’s also accessible via the Claude app and Claude Code, Anthropic’s command-line tool for agentic coding tasks.

The release positions Anthropic competitively in the increasingly crowded frontier model landscape, where Google, OpenAI, and emerging players like xAI are racing to deliver more capable AI systems. With its combination of strong performance on practical tasks, token efficiency, and reduced pricing, Claude Opus 4.5 represents Anthropic’s bid to capture enterprise users who need powerful AI capabilities at more predictable costs.

Posted in AI