The merry-go-around of the frontier labs besting each other with new model releases is continuing in 2026.
Anthropic announced Claude Opus 4.6 on Thursday, just three months after releasing its predecessor, Opus 4.5. The new flagship model demonstrates substantial performance gains across nearly every major AI benchmark, positioning it ahead of both Google’s Gemini 3 Pro and OpenAI’s GPT-5.2 in most categories. Anthropic released Opus 4.6 with an Apple-like video that seemed directed to consumers — much like its recent Superbowl ad — as opposed to its traditional userbase of enterprise users.
Dominant Performance Across Key Benchmarks
Claude Opus 4.6 establishes itself as the industry leader in several critical areas. On Terminal-Bench 2.0, which evaluates agentic terminal coding capabilities, the model achieved 65.4%—surpassing Opus 4.5’s 59.8%, Gemini 3 Pro’s 56.2%, and GPT-5.2’s 64.7%. This benchmark tests how well AI models can handle real-world software engineering tasks in a command-line environment.
The model’s coding prowess extends to other metrics as well. On SWE-bench Verified, which measures verified software engineering capabilities, Opus 4.6 scored 80.8%, matching Opus 4.5’s 80.9% and slightly edging out GPT-5.2’s 80.0%. Gemini 3 Pro trailed on this particular benchmark with 76.2%.

Breakthrough in Computer Use and Tool Integration
Perhaps most impressive is Opus 4.6’s performance on OSWorld, the agentic computer use benchmark. The model achieved 72.7%, a significant jump from Opus 4.5’s 66.3% and well ahead of Sonnet 4.5’s 61.4%. Neither Gemini 3 Pro nor GPT-5.2 have reported scores on this benchmark.
For agentic tool use, Opus 4.6 demonstrated strong capabilities across both retail and telecom domains. On the τ2-bench retail evaluation, it scored 91.9%, slightly ahead of Opus 4.5’s 88.9% and GPT-5.2’s 82.0%. On the telecom evaluation, Opus 4.6 achieved an exceptional 99.3%, matching or exceeding scores from competing models.
Mixed Results on Scaled Tool Use
The MCP Atlas benchmark for scaled tool use showed more nuanced results. Opus 4.6 scored 59.5%, which actually represented a decrease from Opus 4.5’s 62.3%. GPT-5.2 led this benchmark with 60.6%, while Sonnet 4.5 lagged at 43.8% and Gemini 3 Pro achieved 54.1%.
Search and Research Capabilities
Anthropic positioned Opus 4.6 as state-of-the-art for deep research and information retrieval. On BrowseComp, which measures agentic search capabilities, the model scored 84.0%—significantly higher than Opus 4.5’s 67.8% and Sonnet 4.5’s 43.9%. Gemini 3 Pro, using its Deep Research variant, achieved 59.2%, while GPT-5.2 Pro scored 77.9%.
Reasoning and Problem-Solving
The new model showed substantial improvements in multidisciplinary reasoning. On Humanity’s Last Exam, which tests complex reasoning without tool access, Opus 4.6 achieved 40.0%, compared to Opus 4.5’s 30.8% and Sonnet 4.5’s 17.7%. When tools were enabled, scores improved dramatically: Opus 4.6 reached 53.1%, while Gemini 3 Pro hit 45.8% and GPT-5.2 Pro achieved 50.0%.
The standout performance came on ARC AGI 2, a novel problem-solving benchmark designed to test reasoning on tasks that are easy for humans but difficult for AI systems. Opus 4.6 scored an impressive 68.8%, far exceeding Opus 4.5’s 37.6%, Gemini 3 Pro’s 45.1%, and GPT-5.2 Pro’s 54.2%. This represents an 83% improvement over its predecessor and suggests meaningful progress toward more general reasoning capabilities.
Financial Analysis and Knowledge Work
For enterprise applications, Opus 4.6 demonstrated strong performance on the Finance Agent benchmark, achieving 60.7%—ahead of Opus 4.5’s 55.9%, Sonnet 4.5’s 54.2%, and significantly better than Gemini 3 Pro’s 44.1%. GPT-5.2’s 5.1 variant scored 56.6%.
On GDPval-AA Elo, which evaluates economically valuable knowledge work across finance, legal, and other domains, Opus 4.6 scored 1606—substantially higher than Opus 4.5’s 1416, Sonnet 4.5’s 1277, and GPT-5.2’s 1462. Gemini 3 Pro trailed at 1195.
Advanced Reasoning Tasks
Graduate-level reasoning, as measured by GPQA Diamond, showed Opus 4.6 at 91.3%—slightly ahead of Gemini 3 Pro’s 91.9% and GPT-5.2 Pro’s 93.2%, but behind Opus 4.5’s 87.0%.
Visual reasoning benchmarks revealed competitive but not dominant performance. On MMMU Pro without tools, Opus 4.6 scored 73.9% compared to Gemini 3 Pro’s leading 81.0% and GPT-5.2’s 79.5%. With tools enabled, Opus 4.6 achieved 77.3%, while GPT-5.2 reached 80.4%.
Multilingual Capabilities
The model maintained strong multilingual performance on MMMLU, scoring 91.1%—just behind Gemini 3 Pro’s 91.8% and Opus 4.5’s 90.8%, but ahead of Sonnet 4.5’s 89.5% and GPT-5.2’s 89.6%.
Context Window and New Features
Beyond benchmark performance, Opus 4.6 introduces a one-million-token context window in beta—a first for Anthropic’s Opus family. This expanded capacity enables the model to work with entire codebases, large document sets, and complex enterprise workflows in a single session.
The model also powers new “agent teams” functionality in Claude Code, allowing multiple AI agents to work on different aspects of a task in parallel. Additionally, Anthropic announced Claude in PowerPoint as a research preview, extending the model’s reach into productivity applications alongside the previously released Claude in Excel.
Pricing and Availability
Claude Opus 4.6 is available immediately via claude.ai, the Claude API, and major cloud platforms. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens. Developers can access the model using the identifier “claude-opus-4-6” via the API.
Market Context
The release comes amid intense competition in the AI model space. OpenAI recently introduced new platforms for AI agents, while Google continues to enhance its Gemini family. Anthropic’s rapid iteration—releasing Opus 4.6 just three months after Opus 4.5—signals the accelerating pace of AI development.
The model’s strong showing on practical benchmarks like Terminal-Bench and GDPval-AA suggests Anthropic is prioritizing real-world business applications over pure academic performance. With approximately 80% of Anthropic’s business coming from enterprise customers, this strategic focus appears well-aligned with market demand. As the frontier labs continue their competitive sprint, the question remains whether rapid model releases will translate into sustained business value—or simply accelerate the commoditization of AI capabilities across the industry.