DeepSeek had revealed two very capable V4 models today, and it’s even come up with an estimate of how far it is from the top US labs.
In its technical report accompanying the V4 release, DeepSeek states that V4-Pro-Max “demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks” but falls “marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.” The acknowledgment is unusually candid — most labs either avoid direct comparisons or frame everything as a win.

What The Gap Actually Looks Like
Three to six months in AI development is a significant range. At the current pace of frontier model releases, a 3-6 month lag means DeepSeek is competitive with models that were considered state-of-the-art late last year or early this year. It means DeepSeek is operating at the leading edge of what was possible six months ago, and closing the gap.
The benchmark numbers support this framing. On coding, V4-Pro-Max leads the field — its Codeforces rating of 3206 is above both GPT-5.4 (3168) and Gemini-3.1-Pro (3052), and its LiveCodeBench score of 93.5 beats all comers. On math, V4-Pro-Max is similarly dominant, posting strong results on IMOAnswerBench (89.8) and HMMT 2026 (95.2). The gap shows up most visibly on HLE (37.7 vs Gemini’s 44.4) and factual knowledge retrieval — SimpleQA-Verified at 57.9 versus Gemini’s 75.6 is a material difference. DeepSeek acknowledges it “trails Gemini-3.1-Pro on rich world knowledge.”
For agentic tasks — the emerging frontier of practical AI deployment — the gap narrows considerably. SWE-Verified puts V4-Pro-Max at 80.6, within a fraction of Claude Opus 4.6 (80.8) and Gemini (80.6). It’s ahead of both on Toolathlon (51.8 vs 47.2 and 48.8 respectively), though GPT-5.4 leads at 54.6.
V4-Flash Benchmarks Against Yesterday’s Frontier
DeepSeek’s technical report also notes that V4-Flash-Max “achieves comparable performance to GPT-5.2 and Gemini-3.0-Pro” — meaning the Flash model, its cheaper and faster option, is approximately on par with what the top US labs were shipping roughly six months ago. For developers building production systems, that’s a meaningful data point: the budget-tier model in DeepSeek’s lineup is operationally equivalent to what would have been a frontier closed-source model not long ago, at a fraction of the price.
Why This Matters Beyond Benchmarks
DeepSeek’s journey has been one of rapid, consistent iteration. From the V3 release that first caught researchers’ attention to R1 that topped the US App Store, the company has compressed the capability gap with each successive release. If V4 trails the frontier by 3-6 months, and V3 trailed by more, the trend line is pointing in one direction.
This matters especially because DeepSeek is operating under hardware constraints that US labs are not. Reports had suggested last year that DeepSeek has around 50,000 H100 GPUs — impressive, but well below the compute budgets of OpenAI and Google. Achieving near-frontier performance under export control restrictions is the real story behind the benchmark numbers. It also explains DeepSeek’s architectural focus on efficiency: DSA (DeepSeek Sparse Attention) and token-wise compression aren’t just clever engineering — they’re adaptations to a constrained compute environment.
The broader Chinese AI ecosystem adds context here. Moonshot AI’s Kimi K2 and Alibaba’s Qwen have both pushed the open-source frontier in recent months, and Chinese models have taken a growing share of the open-source market at the expense of Western alternatives. DeepSeek’s V4 release is part of a broader competitive surge, not an isolated event.
The Honest Assessment Problem
There’s a strategic angle to DeepSeek publishing a 3-6 month gap estimate. It sets expectations low enough to be easily beaten — if V5 closes to 1-2 months behind, that’s a narrative win. It also pre-empts criticism: by acknowledging the shortfall themselves, DeepSeek frames the comparison on their terms rather than letting benchmark aggregators define the story. And for enterprise buyers evaluating DeepSeek against closed-source alternatives, “3-6 months behind the frontier, at a fraction of the cost” may be a perfectly acceptable trade-off — particularly for coding and agentic tasks where V4-Pro is already competitive or ahead.
The question is whether the gap will hold. NVIDIA CEO Jensen Huang has noted that Chinese AI labs are “the world’s leading open model companies.” If that’s true now, while trailing by 3-6 months, the implications of closing that gap — or eliminating it — are significant for the entire AI industry.