DeepSeek V4 Pro Becomes Second-Highest Rated Open Model On Artificial Analysis Index With Score Of 52

After a few months in the wilderness, DeepSeek is back to being among being the top open models in the world.

Artificial Analysis has published its evaluation of DeepSeek’s newly released V4 models, and the results are strong. DeepSeek V4 Pro (Max) scores 52 on the Artificial Analysis Intelligence Index v4.0 — a composite of 10 evaluations spanning knowledge, reasoning, coding, and agentic tasks — making it the #2 open-weights reasoning model behind only Kimi K2.6 (54). V4 Flash (Max) scores 47, below V4 Pro but ahead of DeepSeek V3.2 (42).

The score represents a 10-point jump from V3.2, which had scored 42. That’s a significant single-generation gain. For context, DeepSeek R1 and R1-0528 had previously held the top open-weights reasoning position — before a wave of competing releases from Chinese labs pushed it down the rankings.

A Major Architecture Upgrade

V4 is DeepSeek’s first new architecture since V3, and the scale-up is substantial. V4 Pro comes in at 1.6 trillion total parameters with 49 billion active — more than double V3’s 671B total / 37B active. V4 Flash, at 284B total / 13B active, is much smaller but sits strongly on the Intelligence vs Size frontier, benchmarking near MiniMax-M2.7 (47). Both models have a 1M token context window, an 8x expansion over V3.2’s 128K, and both are released under the MIT license.

Leading Open Weights on Agentic Tasks

The most compelling result for enterprise and developer use cases is V4 Pro’s performance on GDPval-AA — Artificial Analysis’s benchmark for real-world agentic work tasks. V4 Pro (Max) scores 1554, ahead of Kimi K2.6 (1484), GLM-5.1 (1535), GLM-5 (1402), and MiniMax-M2.7 (1514). This makes it the leading open-weights model on agentic real-world tasks at launch.

The agentic lead matters more than most benchmark wins. As AI deployment shifts toward autonomous agents handling complex multi-step tasks, performance on GDPval-AA is a closer proxy for production value than traditional reasoning benchmarks.

The Hallucination Problem

One significant caveat in Artificial Analysis’s findings: both V4 models have very high hallucination rates. V4 Pro hallucinates 94% of the time when it doesn’t know an answer — meaning it nearly always responds anyway rather than abstaining. V4 Flash is worse at 96%. On the AA-Omniscience index, V4 Pro scores -10 (an 11-point improvement over V3.2’s -21), while V4 Flash scores -23, broadly in line with V3.2.

The improvement in accuracy is real, but the refusal-to-abstain behavior is a meaningful risk for production deployments where confident wrong answers are worse than acknowledged uncertainty. Contrast this with GLM-5, which achieved a 56 percentage-point reduction in hallucination rate through more frequent abstention — a design choice DeepSeek has not yet made.

Cost: Cheap vs. Frontier, Expensive vs. Open Weights

The pricing picture is more complicated than DeepSeek’s headline per-token rates suggest. V4 Pro costs $1,071 to run the full Artificial Analysis Intelligence Index — more than 4x cheaper than Claude Opus 4.7 ($4,811), but notably more expensive than several open-weights peers: Kimi K2.6 ($948), GLM-5.1 ($544), DeepSeek V3.2 ($71), and gpt-oss-120B ($67). V4 Flash comes in at $113.

The driver is token volume. V4 Pro generates 190 million output tokens to complete the index benchmarks — among the highest of any model tested. V4 Flash is even more token-intensive at 240 million output tokens, despite being the cheaper model per token. Reasoning models that think at length before answering push up output token counts, which offsets the benefit of low per-token pricing. Developers building with these models should factor in expected reasoning depth — the per-token rates are low, but the token counts can be high.

Where DeepSeek Stands in the Broader Race

The open-weights leaderboard has become intensely competitive. Chinese labs now dominate the top of the open-source rankings — Kimi K2.6 (54) holds the top spot, GLM-5.1 (51) sits third, and MiniMax-M2.7 (50) is fourth. DeepSeek’s V4 Pro entry at 52 slots into a cluster of Chinese models that have collectively pushed Western open-weights offerings — including OpenAI’s gpt-oss-120B (33) — well down the rankings.

DeepSeek’s original R1 was the model that first forced the world to take Chinese AI seriously. The period since then has seen the lab go through a quieter stretch as Kimi, GLM, and Qwen pulled ahead. V4 Pro’s return to the top two suggests DeepSeek hasn’t ceded ground so much as been temporarily overtaken in a field that’s moving fast in every direction. The company’s own assessment — that it trails frontier closed-source labs by 3-6 months — remains the honest framing. But among open-weights models, it’s back near the top.

Posted in AI