Claude Fable 5 Tops New Artificial Analysis Intelligence Index v4.1 With Score Of 60, Claude Opus 4.8 Placed Second

Anthropic is dominating the new rejigged Artificial Intelligence Index. Artificial Analysis has rolled out version 4.1 of its Intelligence Index, and the top of the leaderboard belongs to a model nobody outside Anthropic can currently use. Claude Fable 5, running with an Opus 4.8 fallback, scores 60 and leads the pack by four clear points, but the model was pulled offline worldwide after US export control directives barred foreign access on national security grounds. That leaves Claude Opus 4.8 (max) as the most capable model anyone can actually call through an API right now, scoring 56 and sitting just one point above OpenAI’s GPT-5.5 (xhigh) at 55.

The gap between Opus 4.8 and GPT-5.5 is well within the index’s usual margin of error, but Anthropic also takes third and fifth, with Claude Sonnet 4.6 (max) at 47 trailing only Google’s Gemini 3.5 Flash, which surprised many by landing at 50 despite its “Flash” branding usually implying a lighter, cheaper model rather than a frontier contender.

What Changed In v4.1

Artificial Analysis rebuilt the index around agentic workloads this round. Terminal-Bench Hard was swapped for Terminal-Bench 2.1, τ²-Bench Telecom became τ³-Bench Banking, and GDPval-AA was upgraded to a v2 that re-baselines its Elo scale to human performance at 1000, brings in a rotating panel of frontier-model judges, and stretches the turn limit from 100 to 250 so agents get more room to work through longer tasks. IFBench was dropped entirely for having saturated to the point where it stopped separating frontier models from each other.

The bigger structural addition is three new per-task metrics: cost, time, and output tokens, all calculated by running the full Intelligence Index suite and dividing by task count. Artificial Analysis is also now reporting cached input tokens separately, which changes how cheap or expensive a model actually looks once caching discounts are factored in.

Open Weights And The Rest Of The Field

Among open weights models, DeepSeek V4 Pro (max) and MiniMax-M3 share the lead at 44, narrowly ahead of Kimi K2.6 at 43 and MiMo-V2.5-Pro at 42. Gemini 3.1 Pro Preview and Qwen3.7 Max are tied at 46, just below Sonnet 4.6. Further down, models like Grok 4.3 (high), Nemotron 3 Ultra, and MiniMax-M2.7 cluster around 38, while smaller and budget-tier releases such as gpt-oss-20B and Nova 2.0 Pro Preview sit at the bottom of the chart.

The Cost Of Being Smartest

Running the smartest model in the world isn’t cheap. Claude Opus 4.8 (max) is the most expensive model currently available to use, at $1.78 per Intelligence Index task, and Claude Fable 5 would cost $3.25 per task if anyone outside Anthropic could access it. GPT-5.5 (xhigh) scores within a point of Opus 4.8 while costing roughly half as much, at $0.99 per task — a pricing gap that’s likely to matter more to enterprise buyers than the single-point intelligence difference. On the other end, DeepSeek V4 Pro (max) stands out as a clear efficiency outlier, running its share of the index at a fraction of what the leading proprietary models charge while still matching MiniMax-M3 for the top open weights score.

Time And Reasoning Trade-offs

Time per task tells a related story. Grok 4.3 (high) finishes an average task in 1.5 minutes, the fastest among leading models, while Claude Sonnet 4.6 (max) takes 13.5 minutes — roughly nine times longer. Sonnet 4.6 actually runs slower than the larger Opus 4.8 here, simply because it burns through more output tokens working the same problems. Gemini 3.1 Pro Preview is the real standout on efficiency, scoring 46 while needing just 1.6 minutes per task, which puts it close to Sonnet 4.6’s intelligence at a fraction of the inference time.

GDPval-AA v2 carries the heaviest weight in the index at 20%, ahead of Terminal-Bench 2.1 at 16% and τ³-Bench Banking at 14%. Claude Fable 5 leads this evaluation by a wide margin at 1818 Elo, with Opus 4.8 well behind at 1638 and GPT-5.5 (xhigh) at 1531, reinforcing Anthropic’s earlier lead on the same benchmark when Opus 4.8 first launched.

What makes this round of results a little awkward for Anthropic is that its best work is currently locked away. The company built a model that beats everything else on the market by a comfortable margin, then had to take it offline globally within hours of a government directive, leaving Opus 4.8 to carry the flag instead. Given how often Anthropic, OpenAI, and Google have swapped the top spot over the past few index revisions, that one-point gap between Opus 4.8 and GPT-5.5 isn’t likely to hold for long either way.