Anthropic Releases Claude Opus 4.5, Beats Gemini 3.0 On Coding, Agentic Use Benchmarks

AI is progressing at such pace that it’s hard to keep up.

A week after Google released its benchmark-smashing Gemini 3.0 Pro model, Anthropic has released its Claude Opus 4.5 model. The new flagship from the AI safety-focused company delivers what the company calls “frontier performance” across coding, agentic workflows, and complex reasoning tasks—setting new standards in several key enterprise benchmarks.

Leading the Pack in Agentic Coding

Claude Opus 4.5 achieves an impressive 80.9% on SWE-bench Verified, the industry’s premier benchmark for real-world software engineering tasks. This surpasses both Google’s Gemini 3.0 Pro (76.2%) and its own predecessor, Claude Sonnet 4.5 (77.2%). More significantly, it outperforms specialized models like OpenAI’s Codex-Max, which scored 77.9% on the same benchmark.

The model’s dominance extends to terminal-based coding tasks, where it achieves 59.3% on Terminal-bench 2.0—substantially ahead of Gemini 3.0 Pro’s 54.2% and Sonnet 4.5’s 50.0%. This suggests Opus 4.5 excels at the kind of autonomous, multi-step coding workflows that enterprises increasingly rely on.

Agentic Tool Use: A New Frontier

Perhaps most striking is Opus 4.5’s performance in agentic tool use scenarios. On the t2-bench evaluation, which tests how well models can orchestrate complex workflows using multiple tools, Opus 4.5 demonstrates strong capabilities across different domains.

In retail scenarios, the model scores 88.9%—matching Gemini 3.0 Pro’s 85.3% but with greater consistency. On telecom tasks, Opus 4.5 achieves 98.2%, essentially perfect performance that matches both Sonnet 4.5 and Gemini 3.0 Pro’s 98.0% scores.

Anthropic emphasizes that Opus 4.5 reaches peak performance in just four iterations when refining its approach to complex tasks, while competing models require up to ten attempts to achieve similar quality.

Enterprise-Grade Reasoning

Claude Opus 4.5 scores 87.0% on GPQA Diamond, a benchmark testing graduate-level reasoning across physics, chemistry, and biology. While this trails Gemini 3.0 Pro’s leading 91.9%, it represents strong performance on problems requiring deep domain expertise.

The model also achieves 80.7% on MMMU, a multimodal understanding benchmark that combines visual and textual reasoning—comparable to Sonnet 4.5’s 77.8% and competitive with Gemini 3.0 Pro’s performance.

On multilingual question answering (MMMLU), Opus 4.5 scores 90.8%, placing it just behind Gemini 3.0 Pro (91.8%) but ahead of most competitors in understanding and responding across multiple languages.

Scaled Tool Use and Computer Control

Anthropic reports that Opus 4.5 achieves 62.3% on MCP Atlas, a benchmark for scaled tool use with multiple simultaneous tools and complex workflows. This significantly outpaces Sonnet 4.5 (43.8%) and Opus 4.1 (40.9%), though Gemini 3.0 Pro scores aren’t available for direct comparison.

In computer use scenarios—where the model controls a desktop environment to complete tasks—Opus 4.5 scores 66.3% on OSWorld, ahead of Sonnet 4.5’s 61.4% and well clear of Opus 4.1’s 44.4%.

Novel Problem Solving

On ARC-AGI-2, which tests novel problem-solving abilities that can’t be memorized from training data, Opus 4.5 achieves 37.6%. While Sonnet 4.5 scores just 13.6% on this particularly challenging benchmark, Gemini 3.0 Pro reaches 31.1%, and GPT-5.1 trails at 17.6%.

Pricing and Availability

Claude Opus 4.5 is available immediately through Claude Chat, Claude Code, and via API. Developer pricing is set at $5 per million input tokens and $25 per million output tokens. The model is available on major cloud platforms including AWS Bedrock and Google Cloud’s Vertex AI, alongside Anthropic’s native API.

The AI Arms Race Intensifies

The rapid succession of releases—GPT-5.1 in early November, Gemini 3.0 Pro in mid November, and now Claude Opus 4.5—underscores the breakneck pace of AI development. Each company is leapfrogging the others in specific domains, with Anthropic claiming particular strength in agentic coding and tool use, and Google leading in pure reasoning, image generation and video benchmarks, and the competitive landscape shifting weekly. But with Anthropic now again releasing a state-of-the-art coding model, it appears that the company will maintain its dominance in enterprise and business-focused use-cases for the time being.