Anthropic Releases Claude Sonnet 4.5, Beats GPT-5 And Gemini 2.5 Pro On Coding Benchmarks

The frontier AI labs continue to outdo each other with their new model releases.

Anthropic has launched Claude Sonnet 4.5, positioning it as “the best coding model in the world”. Claude Sonnet 4.5’s benchmark results surpass competing models from OpenAI and Google across multiple dimensions of software development and agentic workflows, and show gains in other areas.

claude sonnet 4.5 benchmarks

Dominance in Agentic Coding

The new model demonstrates exceptional performance in agentic coding tasks, achieving 77.2% on the standard SWE-bench Verified benchmark. This performance extends to 82.0% when parallel test-time compute is enabled, significantly outpacing GPT-5 Codex (74.5%), Claude Sonnet 4 (80.2%), and Gemini 2.5 Pro (67.2%).

Perhaps more impressive is Claude Sonnet 4.5’s performance on Terminal-Bench, which measures agentic terminal coding capabilities. The model achieved a 50.0% success rate, substantially ahead of Claude Opus 4.1 (46.5%), Claude Sonnet 4 (36.4%), GPT-5 (43.8%), and Gemini 2.5 Pro (25.3%). This benchmark is particularly relevant as it tests the model’s ability to navigate command-line interfaces and execute complex development tasks autonomously.

Leading Agentic Tool Use

Claude Sonnet 4.5 sets new standards in agentic tool use across the tau-bench evaluations. In the Retail domain, it achieved 86.2%, closely matching Claude Opus 4.1’s 86.8% while outperforming GPT-5 (81.1%) and Claude Sonnet 4 (83.8%).

The model’s superiority becomes more pronounced in specialized domains. For Airline tasks, Claude Sonnet 4.5 scored 70.0%, significantly ahead of its competitors who all clustered around 63%. Most notably, in the Telecom domain, Claude Sonnet 4.5 achieved a remarkable 98.0% success rate, dramatically outperforming Claude Opus 4.1 (71.5%), GPT-5 (56.7%), and Claude Sonnet 4 (49.6%).

Computer Use Leadership

In the OSWorld benchmark measuring computer use capabilities, Claude Sonnet 4.5 achieved 61.4%, substantially outperforming Claude Opus 4.1 (44.4%) and Claude Sonnet 4 (42.2%). This metric is critical for evaluating models’ ability to interact with desktop environments, navigate interfaces, and complete tasks that require understanding visual layouts and executing precise actions.

Strong Performance Across Core Capabilities

Beyond coding, Claude Sonnet 4.5 demonstrates competitive performance across diverse benchmarks. On the AIME 2025 high school math competition, the model achieved a perfect 100% score using Python, and 87.0% without tools, outperforming Claude Opus 4.1 (78.0%), Claude Sonnet 4 (70.5%), and competing with GPT-5’s 99.6% (python) and 94.6% (no tools).

In graduate-level reasoning measured by GPQA Diamond, Claude Sonnet 4.5 scored 83.4%, positioning it competitively against GPT-5 (85.7%) and Gemini 2.5 Pro (86.4%), while surpassing Claude Opus 4.1 (81.0%) and Claude Sonnet 4 (76.1%).

The model achieved 89.1% on MMMLU multilingual Q&A, matching the performance tier of its competitors, and scored 77.8% on MMMU visual reasoning validation, demonstrating strong multimodal capabilities.

For financial analysis tasks using the Finance Agent benchmark, Claude Sonnet 4.5 achieved 55.3%, outperforming all tested competitors including GPT-5 (46.9%) and Gemini 2.5 Pro (29.4%).

Enhanced Developer Tools and API Capabilities

Alongside the model release, Anthropic announced significant upgrades to Claude Code, including a refreshed terminal interface and a new VS Code extension that brings Claude directly into developers’ IDEs. A new checkpoints feature enables developers to confidently execute large tasks with the ability to instantly roll back to previous states.

The company has also expanded Claude’s data analysis capabilities, allowing the model to analyze data, create files, and visualize insights. This feature is now available in preview to all paid plans.

For API users, Anthropic introduced two critical new capabilities for building agents that handle long-running tasks: context editing to automatically clear stale context and a memory tool to store and consult information outside the context window. These features address common challenges developers face when building agentic applications that need to maintain state across extended interactions.

The Claude for Chrome extension, previously available only to waitlist participants, has now been rolled out to everyone who joined the waitlist last month.

Availability and Pricing

Claude Sonnet 4.5 is available immediately on the Claude Developer Platform, with native integrations on Amazon Bedrock and Google Cloud’s Vertex AI. Anthropic has maintained the same pricing as Claude Sonnet 4, making the performance improvements available without additional cost.

Claude Sonnet 4.5 shows that Anthropic is doubling down on its lead in coding. The biggest gains for the model are on coding benchmarks, while the gains on their benchmarks, such as those based on math and science, are relatively more modest. Anthropic has done very well with its coding focus — it now has an $5 billion ARR, most of which comes from professional coders and enterprise customers. GPT-5 and Codex had threatened to poach some of these users from Anthropic with their rapid releases, but with Sonnet 4.5, Anthropic is showing that it still has what it takes to be one of the most popular coding models on the market.

Posted in AI