Claude Mythos Preview Beats Google Gemini 3.1 Pro, GPT 5.4 On Most Benchmarks

Claude Mythos hasn’t been released to the public, but its abilities seem to far surpass those of any other model available publicly.

Anthropic’s system card for Mythos Preview includes a full capability evaluation table comparing the unreleased model against Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro — currently the two strongest publicly available models from rival labs. The results are not close on most dimensions.

Claude Mythos Preview Benchmarks vs Gemini 3.1 Pro & GPT 5.4

Coding

On SWE-bench Verified, the premier real-world software engineering benchmark, Mythos Preview scores 93.9%. Gemini 3.1 Pro scores 80.6%. GPT-5.4 has no reported score in the table. Opus 4.6, Anthropic’s current public flagship, scores 80.8% — itself the top score among public models when it launched.

On SWE-bench Pro — the harder tier designed for production-grade tasks — Mythos scores 77.8% against GPT-5.4’s 57.7% and Gemini 3.1 Pro’s 54.2%. That’s a 20-point lead over the best publicly available model. GPT-5.4’s SWE-bench Pro score of 57.7% was itself considered a strong result when the model launched in early March.

Terminal-Bench 2.0, which tests autonomous multi-step terminal coding, shows Mythos at 82% against GPT-5.4’s 75.1% and Gemini 3.1 Pro’s 68.5%. Anthropic notes that OpenAI used a specialized harness for their Terminal-Bench score, making direct comparison imprecise — but Mythos leads regardless of the caveat.

The SWE-bench Multimodal result is the most dramatic: 59% for Mythos against 27.1% for Opus 4.6, with no scores reported for GPT-5.4 or Gemini 3.1 Pro. On SWE-bench Multilingual, Mythos scores 87.3% against Opus 4.6’s 77.8%, again with competitors absent from the table.

Reasoning

GPQA Diamond, the graduate-level scientific reasoning benchmark, shows Mythos at 94.5%, edging Gemini 3.1 Pro’s 94.3% and GPT-5.4’s 92.8%. The differences here are narrow — this benchmark is now approaching saturation at the top — but Mythos leads.

The most eye-catching reasoning result is USAMO (the US Math Olympiad benchmark). Mythos Preview scores 97.6%. GPT-5.4 scores 95.2%. Gemini 3.1 Pro scores 74.4%. Opus 4.6 scores 42.3%. This is a benchmark where GPT-5.4 had recently dominated, with its 95% score considered a landmark result. Mythos clears it by 2.4 points.

GraphWalks BFS 256K-1M, which tests long-context reasoning over complex graph structures, shows Mythos at 80% against GPT-5.4’s 21.4% and Opus 4.6’s 38.7%. Gemini 3.1 Pro has no reported score. A near four-to-one lead over GPT-5.4 on a long-context reasoning task suggests Mythos has made qualitative gains in how it handles very large contexts.

On Humanity’s Last Exam without tools — raw reasoning, no search — Mythos scores 56.8% against GPT-5.4’s 39.8% and Gemini 3.1 Pro’s 44.4%. With tools, the scores rise: Mythos 64.7%, Gemini 3.1 Pro 51.4%, GPT-5.4 52.1%.

CharXiv Reasoning, a scientific figure interpretation benchmark, shows Mythos at 86.1% without tools and 93.2% with tools, against Opus 4.6’s 61.5% and 78.9% respectively. GPT-5.4 and Gemini 3.1 Pro have no reported scores here.

Computer Use

On OSWorld, which measures autonomous desktop navigation via mouse and keyboard, Mythos scores 79.6% against GPT-5.4’s 75.0% and Opus 4.6’s 72.7%. Gemini 3.1 Pro has no reported score. This is notable because OSWorld computer use was one of GPT-5.4’s headline capabilities at launch — the first general-purpose model OpenAI released with native computer use built in.

The Caveat Worth Keeping In Mind

Mythos Preview is not publicly available. Its benchmark numbers come from Anthropic’s own system card, using Anthropic’s own evaluation configurations — adaptive thinking at max effort, averaged over five trials. Competitor figures are drawn from each lab’s own published system cards and leaderboards, which use different configurations and harnesses. These are not apples-to-apples comparisons, and the Mythos numbers represent the model at maximum effort, not a default deployment setting.

More to the point: there is no particular reason to believe that Google and OpenAI lack comparable private models of their own. All three frontier labs invest heavily in model development, and the public release cadence of any lab reflects product and safety timelines as much as raw capability. Google had itself kept powerful models internal for years before competitive pressure forced its hand. OpenAI tests unreleased models under codenames on public arenas before deciding whether to launch them. The gap between what a lab can do and what it chooses to deploy is a known feature of this industry.

What Mythos Preview’s system card demonstrates is that Anthropic has a model that substantially outperforms the current public frontier on coding, mathematical reasoning, and agentic tasks. Whether that lead reflects a durable capability advantage, or simply an earlier moment in a staggered release cycle that GPT-5.5 and Gemini 3.2 will close quickly, is a question the next few months will answer.