Anthropic Releases Claude Sonnet 4.6, Approaches Opus 4.6 On Many Benchmarks At A Lower Price-point

Gemini 3 Flash had approached Gemini 3 Pro on many benchmarks, and Anthropic now seems to have done an encore with its Sonnet 4.6 model.

Anthropic has released Claude Sonnet 4.6. The newly released model represents a significant step forward for the company’s mid-tier lineup, delivering performance that rivals its flagship Opus 4.6 across several key benchmarks — and doing so at a price point that makes it far more practical for everyday enterprise and developer use.

Claude Sonnet 4.6 Benchmarks

Anthropic describes Sonnet 4.6 as a “full upgrade” across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. The model also introduces a 1 million token context window in beta — a substantial leap that enables it to process entire codebases, lengthy legal documents, or complex multi-document research tasks in a single pass. It has a reliable knowledge cutoff of August 2025.

On the SWE-bench Verified benchmark for agentic coding, Sonnet 4.6 scores 79.6%, closing in on Opus 4.6’s 80.8% and edging past GPT-5.2’s 80.0%. For agentic financial analysis (Finance Agent v1.1), Sonnet 4.6 leads the pack outright at 63.3%, ahead of Opus 4.6 at 60.1% and GPT-5.2 at 59.0%. In office tasks measured by GDPval-AA Elo, Sonnet 4.6 scores 1633 — higher than every competitor listed, including Opus 4.6 at 1606 and GPT-5.2 at 1462.

claude 4.6 sonnet benchmarks — Claude Sonnet 4.6 benchmarks

Computer Use: A Breakthrough Trajectory

Perhaps the most striking improvement is in computer use. On the OSWorld-Verified benchmark, Sonnet 4.6 scores 72.5%, up from Sonnet 4.5’s 61.4% and dramatically ahead of GPT-5.2’s 38.2%. To put that trajectory in context: Claude Sonnet 3.5 scored just 14.9% on the same benchmark in October 2024. Within roughly 16 months, the score has nearly quintupled.

Anthropic says early users are already seeing human-level performance on tasks like completing complex spreadsheets and navigating multi-step web forms — capabilities that have significant implications for enterprise automation workflows.

Where Opus 4.6 Still Leads

Opus 4.6 maintains its edge in a handful of high-complexity domains. On agentic terminal coding (Terminal-Bench 2.0), Opus scores 65.4% versus Sonnet’s 59.1%. For agentic search on BrowseComp, Opus leads at 84.0% compared to Sonnet’s 74.7%. On novel problem-solving (ARC-AGI-2), Opus scores 68.8% to Sonnet’s 58.3%. And for graduate-level reasoning on GPQA Diamond, Opus edges ahead at 91.3% versus Sonnet’s 89.9%.

These gaps suggest Opus 4.6 remains the right choice for cutting-edge research, deep reasoning tasks, and scenarios where top-tier accuracy is non-negotiable. But for the majority of production workloads, the delta has shrunk considerably.

Claude Sonnet 4.6 Pricing and Availability

Sonnet 4.6 is priced at $3 per million input tokens and $15 per million output tokens — unchanged from Sonnet 4.5. By comparison, Opus 4.6 is positioned at a premium tier, making Sonnet 4.6’s near-parity on many benchmarks a compelling value proposition for cost-sensitive deployments.

The model is available immediately across all Anthropic plans including the free tier, which has been upgraded to Sonnet 4.6 by default and now includes file creation, connectors, skills, and compaction. Sonnet 4.6 is also accessible via Claude Code, the Anthropic API, Cowork, and all major cloud platforms.

What It Means for the Market

The release reinforces an accelerating trend across the AI industry: the gap between frontier and near-frontier models is narrowing fast. Developers and enterprises that previously required flagship models for complex agentic workflows may now find Sonnet 4.6 sufficient — at a fraction of the cost. For Anthropic, that broadens the addressable market for its most capable mid-tier offering and increases competitive pressure on rivals whose premium models no longer hold as clear an advantage.

With Sonnet 4.6 now the default model on the free tier, Anthropic is also making a clear play for developer mindshare at the earliest stages of product development — a long-standing strategy in the platform wars that has historically paid dividends.