Exponential Progress: Claude Opus 4.6 Has 50% Time Horizon Of 14.5 Hours On METR Time Horizons Benchmark

Elon Musk has said we have entered the singularity, and Sam Altman has said that AI is progressing faster than he expected, and some of these observations are showing up in benchmark data.

New measurements from METR — a nonprofit focused on evaluating AI capabilities — show that Claude Opus 4.6, Anthropic’s latest flagship model, can reliably complete tasks that would take a skilled human expert nearly 15 hours to finish. That number, extraordinary on its own, becomes even more striking when you see how quickly the industry got there.

What the benchmark actually measures

METR’s “time horizon” benchmark is a straightforward but revealing test: give an AI agent a complex, self-contained task — drawn from software engineering, machine learning, and cybersecurity — and see how difficult a task it can complete successfully. Difficulty is measured not in abstract points but in human time: specifically, how long it would take a skilled human professional, with no prior context on the task, to finish the same job.

The “50%-time horizon” is the task length at which a given AI succeeds roughly half the time. Think of it like a high jump bar. If a model has a 2-hour time horizon, that means it clears tasks a human expert would need 2 hours to complete about 50% of the time. The tasks are designed to be messy and realistic enough to require genuine problem-solving — iteratively debugging complex systems, training machine learning models, implementing technical protocols — rather than rote pattern matching.

Claude 4.6 Opus On METR: The numbers

METR’s latest measurements put Claude Opus 4.6 at a 50%-time horizon of approximately 14.5 hours, meaning the model can successfully complete tasks that would take a seasoned human professional more than a full workday roughly half the time. For context, tasks at this level include things like implementing a complex network protocol from scratch using multiple technical specifications simultaneously.

The confidence interval on this measurement is wide — 6 hours to 98 hours — and METR is candid about why: their current task suite is nearly saturated, meaning Claude Opus 4.6 is bumping up against the ceiling of what the benchmark was designed to measure. The researchers themselves caution that “near-saturation can have unintuitive consequences for the time-horizon estimates,” and that they are actively developing updated methods to better track where state-of-the-art models actually stand. In other words, the benchmark may be underselling the model’s true capability.

Why the graph matters as much as the number

Perhaps the most remarkable part of this story isn’t the 14.5-hour figure itself — it’s how fast the industry arrived there. The METR chart plots time horizon against model release date, and the trendline is not linear. It is exponential.

Back in mid-2024, frontier models like GPT-4o had time horizons measured in single-digit minutes. By early 2025, models were clearing tasks in the range of 15 to 30 minutes. By late 2025, that had jumped to several hours. Now, in early 2026, Claude Opus 4.6 sits at 14.5 hours. METR’s fitted trend line shows a doubling time of approximately 123 days based on data from 2023 onward — meaning AI task-completion capability has been roughly doubling every four months.

On a linear scale, this trajectory would look alarming. On the logarithmic scale METR uses in its chart, it looks like a straight line — which is precisely the point. Exponential growth, when plotted on a log scale, produces a straight line, and that line has been remarkably consistent. The R² value of the exponential fit is 0.93, indicating that the trend is not noise. It is a pattern.

What this does and doesn’t mean

METR is careful to note what the time horizon does not imply. A 14.5-hour time horizon doesn’t mean AI can now do 14.5 hours of the kind of work an experienced professional does as part of their established daily role, with years of organizational context, relationships, and institutional knowledge. The benchmark measures performance on well-specified, self-contained tasks given to someone — or something — with no prior context. Think of it as what a highly skilled contractor, handed a clearly written brief, could accomplish.

Most real jobs are also not composed purely of software engineering or machine learning tasks, and many involve human judgment calls, ambiguous goals, and collaboration that algorithmic scoring can’t fully capture. The researchers note that AI performance drops substantially when tasks are evaluated holistically rather than by clear automated criteria.

Still, the direction of travel is unambiguous, and the pace is faster than almost anyone predicted just two years ago. Whether or not one uses the word “singularity,” the data suggests that the gap between what AI agents can do autonomously and what human professionals can do is narrowing at a pace that warrants serious attention from business leaders, policymakers, and anyone whose work involves long, complex cognitive tasks.

Posted in AI