Claude Mythos Shows 50% Time Horizon Of 16+ Hours On METR Benchmark

Independent analyses of Claude Mythos are confirming the step jump in the model’s capabilities over the rest of the field.

METR, the AI safety evaluation organization, has published results showing that Claude Mythos Preview achieves a 50%-time-horizon of at least 16 hours on its software task benchmark — the upper boundary of what the organization can currently measure. The figure represents how long a task takes a human expert to complete at which the AI model still succeeds half the time. At 16+ hours, Mythos has pushed past the ceiling of METR’s existing evaluation infrastructure.

At The Edge Of The Measuring Stick

The 95% confidence interval for Mythos runs from 8.5 hours to 55 hours — a wide band that reflects a fundamental constraint: METR’s current task suite includes only five tasks estimated at 16 hours or longer, out of 228 total. That thin coverage makes precise quantification at this range unstable. METR has been explicit that it does not consider these measurements reliable enough for exact comparisons or extrapolations, and is not highlighting a specific hour figure as a headline claim for models above the 16-hour threshold.

What the data does show clearly is the trajectory. GPT-4o, released in mid-2024, had a 50%-time-horizon of roughly 7 minutes. Sonnet 3.7 reached around 2 hours. OpenAI’s o3 pushed further. Claude Opus 4.6 and GPT-5.2 (high) cluster around 5–6 hours. Mythos Preview lands above all of them, past the point where METR’s current tools can give a firm answer.

The doubling time for task-completion horizons across frontier models, based on METR’s data from January 2024 through February 2026, is approximately 105 days — a growth rate of over 1,000% per year. Mythos sits at the leading edge of that curve.

What The Time Horizon Actually Measures

METR’s time horizon metric is frequently misread. It does not measure how long an AI spends on a task — models typically complete tasks significantly faster than humans, because they write code in fewer iterations and require less lookup time. The metric measures task difficulty, expressed as the human completion time at which the model’s success rate hits 50%.

A 16-hour time horizon does not mean Mythos can automate all 16-hour jobs. METR’s task suite is weighted toward software engineering, machine learning, and cybersecurity — domains where Mythos has already demonstrated dramatic benchmark leads over publicly available models. Performance across other domains varies. Real-world work also involves stakeholder communication, tacit organizational knowledge, and success criteria that can’t be scored algorithmically — none of which METR’s tasks capture.

The metric is best read as a measure of autonomous, well-specified, self-contained technical work that an AI can complete reliably — a meaningful but deliberately bounded slice of what economic productivity actually looks like.

Independent Confirmation Of A Capability Shift

The METR evaluation was conducted in March 2026 during a limited assessment window, primarily to inform Anthropic’s risk assessment process ahead of broader deployment decisions. The timing aligns with Anthropic’s Project Glasswing rollout, which has restricted Mythos access to a vetted group of security and enterprise partners rather than releasing the model publicly.

The METR numbers track with what other evaluations have shown. On SWE-bench Verified, the primary real-world software engineering benchmark, Mythos scores 93.9% — more than 13 points above any publicly available model. On SWE-bench Pro, the harder production-grade tier, it leads GPT-5.4 by 20 points. On BioMysteryBench, it solved 30% of expert-authored bioinformatics problems that human scientists could not answer at all.

Mozilla’s Firefox team offered perhaps the most concrete real-world signal: using Mythos Preview, they fixed 423 security bugs in April 2026 alone — compared to a prior monthly average of 17 to 31. The model identified decades-old vulnerabilities requiring multi-component reasoning across large codebases, including a 20-year-old XSLT bug and a race condition enabling potential sandbox escape.

The Measurement Problem Is Now The Story

METR’s most significant finding may be less about Mythos specifically and more about the state of AI evaluation infrastructure. The organization is candid that its current task suite was not designed for models at this capability level. Five tasks above 16 hours is not enough to draw a reliable logistic curve. METR has said it believes the suite can still distinguish Mythos from publicly available models — but precise quantification at this range requires new tasks, new human baselines, and substantially more evaluation capacity.

That gap between model capability and evaluation infrastructure is consequential. Safety assessments, deployment decisions, and regulatory frameworks all depend on the ability to measure what these systems can actually do. When a model outgrows the ruler, the priority shifts from reading the number to building a longer one.

METR has indicated it is actively working to expand its task suite to cover longer time horizons. Until that work is complete, the honest answer to “how capable is Mythos?” is: more capable than we can currently measure with confidence.