Anthropic's Claude Mythos Preview Smashes Coding Benchmarks, Scores 77.8 On SWE-Bench Pro

Anthropic is maintaining its lead in coding models, and how.

Claude Mythos Preview — the unreleased frontier model at the center of Anthropic’s Project Glasswing cybersecurity initiative — posts benchmark numbers that make the current generation of public models look like an earlier era. Across agentic coding, scientific reasoning, and computer use, Mythos Preview doesn’t just beat Opus 4.6; it laps it on several key tests.

Coding: The Numbers That Matter

On SWE-bench Pro — the hardest tier of the industry’s most-watched software engineering benchmark — Mythos Preview scores 77.8% against Opus 4.6’s 53.4%. That’s a 24-point gap on a test designed to be difficult. For context, when Gemini 3.1 Pro was released, GPT-5.3-Codex led SWE-bench Pro at 56.8% — a score Mythos Preview now exceeds by more than 21 points.

On SWE-bench Verified, the broader real-world software engineering test, Mythos hits 93.9% against Opus 4.6’s 80.8%. On SWE-bench Multilingual, which tests code across programming languages, Mythos scores 87.3% against 77.8% for Opus.

Terminal-Bench 2.0, which measures autonomous multi-step terminal coding — the kind of agentic work that Chinese models like Minimax M2.5 have been pushing hard to match — shows Mythos at 82.0% against Opus 4.6’s 65.4%.

The SWE-bench Multimodal result is the most striking: 59.0% for Mythos versus 27.1% for Opus 4.6. That’s more than double. The benchmark, measured against an internal implementation, tests AI’s ability to understand visual context alongside code — increasingly important as AI agents are asked to work directly with GUIs and interfaces.

Reasoning: A Clear Step Change

Mythos Preview scores 94.6% on GPQA Diamond, the graduate-level scientific reasoning benchmark spanning physics, chemistry, and biology. Opus 4.6 scores 91.3%. These numbers look close, but GPQA Diamond is designed so that marginal gains at the top require substantially greater capability. Claude Opus 4.6 had already beaten Google’s Gemini 3 Pro (91.3% vs 94.3% for Gemini 3.1 Pro) on this test; Mythos now goes further still.

On Humanity’s Last Exam — the benchmark designed to be unsolvable by current AI — Mythos Preview without tools scores 56.8% (Opus 4.6: 40.0%). With tools enabled, Mythos hits 64.7% against Opus 4.6’s 53.1%. The without-tools number is the more meaningful one: it’s a test of raw reasoning, not search-augmented retrieval. Anthropic notes that Mythos still performs well at low effort on HLE, which they flag as a possible sign of some memorization — worth keeping in mind when reading those numbers.

Benchmarks like Humanity’s Last Exam were created specifically because reasoning models were making earlier tests irrelevant. A 56.8% score without tools is still remarkable.

Agentic Search and Computer Use

BrowseComp, which tests complex multi-step web research, shows Mythos at 86.9% against Opus 4.6’s 83.7% — a smaller gap, but notable because Anthropic says Mythos achieves this while using 4.9x fewer tokens. That’s not just smarter; it’s meaningfully more efficient.

OSWorld-Verified, a computer use benchmark where the AI must navigate real desktop interfaces autonomously, shows Mythos at 79.6% against 72.7% for Opus 4.6.

What It All Means

Mythos Preview is not a public model. Anthropic has restricted it to a closed group of security partners and enterprise organizations, citing its dual-use cybersecurity capabilities. But the benchmark profile reveals something broader: the gap between Mythos and the current public frontier is large enough that it represents a qualitative shift, not an incremental one.

Claude Opus 4.6 was already the benchmark leader in most categories when it launched in February 2026, with Claude Sonnet 4.6 in second place on the Artificial Analysis Intelligence Index. Mythos Preview — if released — would reset those leaderboards entirely. On SWE-bench Verified alone, its 93.9% would sit more than 13 points above any publicly available model.

The broader context is competitive pressure from all sides. Chinese open-source models like Z.ai’s GLM-5 have been closing the gap with closed US models on SWE-bench Verified. Mythos Preview suggests Anthropic is not standing still — and that the internal capability gap between what labs have and what they release publicly is wider than most observers assume.