OpenAI Releases GPT-5.4 Pro And GPT-5.4 Thinking, Beats Opus 4.6 And Gemini 3.1 Pro On Many Benchmarks

The frontier AI labs continue to churn out new models that appear to leapfrog their competitors.

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. OpenAI’s latest release consolidates the best of its recent model work into a single general-purpose flagship. GPT-5.4 incorporates the coding strengths of GPT-5.3-Codex while adding first-class support for computer use, extended context, and professional knowledge work tasks — all in a package the company says is more token-efficient than GPT-5.2. That efficiency claim matters: OpenAI prices GPT-5.4 at $2.50 per million input tokens versus $1.75 for GPT-5.2, but argues the model completes tasks with significantly fewer tokens, reducing real-world costs.

“You can steer it mid-response, and it supports 1m tokens of context,” OpenAI CEO Sam Altman said on X.

It is the latest volley in a fast-moving frontier model race that has seen Google, Anthropic, and OpenAI exchange the top spot on leading benchmarks almost month by month.

GPT 5.4 Benchmarks: How It Stacks Up

According to OpenAI’s own benchmarks, GPT-5.4 Thinking outperforms Claude Opus 4.6 and Gemini 3.1 Pro across several key categories. On OSWorld-Verified, which measures a model’s ability to navigate a desktop environment using mouse and keyboard inputs, GPT-5.4 achieved 75.0% — ahead of Claude Opus 4.6’s 72.7% and surpassing human performance at 72.4%. On the GDPval knowledge-work benchmark, which tests performance across 44 professional occupations, GPT-5.4 reached 83.0%, outpacing Opus 4.6’s 78.0%. For agentic web browsing on BrowseComp, GPT-5.4 Pro reached 89.3%, eclipsing Gemini 3.1 Pro’s 85.9%. In software engineering on SWE-Bench Pro, GPT-5.4 scored 57.7%, against Gemini 3.1 Pro’s 54.2%.

GPT 5.4 Pro and GPT 5.4 Thinking benchmarks

Gemini 3.1 Pro had topped most benchmarks since its release two weeks ago, but GPT-5.4’s arrival already calls several of those standings into question. However, Gemini 3.1 Pro retains a narrow edge on GPQA Diamond (94.3% to GPT-5.4 Pro’s 94.4% — virtually tied), and Gemini remains notably cheaper to run at scale.

Claude Opus 4.6, released by Anthropic last month, had itself been a strong challenger, claiming the top spot on the Artificial Analysis Intelligence Index at launch. GPT-5.4 now challenges those gains, particularly in agentic and professional work categories.

Computer Use: A Major Step Forward

Perhaps the most significant new capability in GPT-5.4 is native computer use. OpenAI describes it as “the first general-purpose model” it has released with this capability built in, rather than as a separate specialized system. The model can operate desktop environments via screenshots, issue mouse and keyboard commands, and complete tasks across applications — all through the standard API.

The benchmark results are striking. On OSWorld-Verified, GPT-5.4’s 75.0% success rate far exceeds GPT-5.2’s 47.3% — a 28-point jump in a single generation. Developers building agents for web or desktop automation now have a much stronger general-purpose option, without needing to route to a specialized model.

Professional and Knowledge Work

OpenAI put particular focus on spreadsheet, presentation, and document creation. On an internal benchmark of investment banking spreadsheet modeling tasks, GPT-5.4 scored 87.3%, compared to 68.4% for GPT-5.2 — a 19-point gain. On presentation quality, human raters preferred GPT-5.4 outputs 68% of the time over GPT-5.2, citing stronger aesthetics and image use.

The model also reduces hallucinations: OpenAI says individual factual claims are 33% less likely to be false compared to GPT-5.2, and full responses are 18% less likely to contain any errors. The company is positioning GPT-5.4 as a credible tool for high-stakes professional work in legal, financial, and enterprise contexts.

Tool Use and Agentic Workflows

GPT-5.4 introduces “tool search” in the API, a feature that lets the model look up tool definitions on demand rather than loading all of them into the prompt upfront. In testing with 36 MCP servers, this reduced total token usage by 47% while maintaining the same accuracy. For developers building complex agents on top of large tool ecosystems, this is a meaningful cost reduction.

On Toolathlon, an agentic tool-use benchmark, GPT-5.4 reached 54.6% — ahead of GPT-5.3-Codex at 51.9% and Claude Opus 4.6’s Sonnet variant at 44.8%.

Availability and Pricing

GPT-5.4 is available in the API as gpt-5.4, priced at $2.50 per million input tokens and $15 per million output tokens. GPT-5.4 Pro, targeting the most demanding tasks, is available as gpt-5.4-pro at $30 per million input tokens and $180 per million output tokens. Batch and Flex pricing are available at half the standard rate; priority processing at 2x.

In ChatGPT, GPT-5.4 Thinking is rolling out to Plus, Team, and Pro subscribers starting today, replacing GPT-5.2 Thinking as the default. GPT-5.2 Thinking will remain available under Legacy Models for three months before retiring on June 5, 2026. Enterprise and Edu customers can enable early access through admin settings.

The Bigger Picture

The pace of iteration across the frontier labs has become almost dizzying. As we covered last month, OpenAI launched GPT-5.3-Codex within minutes of Anthropic’s Opus 4.6 release, immediately topping its Terminal-Bench 2.0 score. The pattern is becoming a feature of the industry: no lead holds for long.

GPT-5.4 is a notable release because it is the first time OpenAI has rolled a specialized coding model’s strengths back into its general-purpose flagship — reflecting a bet that the market wants one capable model rather than a portfolio of specialists. Whether that holds as Anthropic and Google respond with their next releases remains to be seen. But for now, OpenAI has reclaimed the top of several key leaderboards — at least until the next announcement.