DeepSeek Releases DeepSeek V4-Pro & V4-Flash, Delivers GPT 5.4 & Opus 4.6-Level Performance At Fraction Of The Price

DeepSeek had first drawn the world’s attention to China’s capabilities in AI before it was overtaken by other Chinese labs, but the company has shown that it’s still very much in the game.

The Chinese AI lab has released two new models — DeepSeek V4-Pro and DeepSeek V4-Flash — continuing the company’s pattern of pairing strong capabilities with aggressive pricing. DeepSeek first made headlines when its R1 model matched OpenAI’s o1 at roughly 90% lower API costs, rattling markets and briefly crashing NVIDIA’s stock. The V4 release suggests the company is doubling down on that same formula.

DeepSeek V4 Models Under The Hood

The two models are meaningfully different in scale. DeepSeek V4-Pro is a 1.6 trillion parameter MoE (Mixture of Experts) model with 49B active parameters, pre-trained on 33 trillion tokens. DeepSeek V4-Flash is considerably lighter — 284B total parameters, 13B active, trained on 32T tokens. Both share a 1M context window, are open-sourced, and available via API and web/app (V4-Pro in “Expert Mode,” V4-Flash in “Instant Mode”).

The active parameter count is the number that matters most for inference cost and speed. At 13B active params, V4-Flash is in the same ballpark as many mid-range models but benefits from a much larger pool of specialized expert layers — giving it performance that, according to DeepSeek, closely approaches V4-Pro on a wide range of tasks while being significantly faster and cheaper to run.

The architectural highlight is what DeepSeek calls DSA — DeepSeek Sparse Attention — combined with token-wise compression, which the company says makes 1M-context inference practical at scale. That’s a meaningful claim: long-context inference is typically expensive enough that most providers either cap it or charge a premium. DeepSeek is making it the default.

Benchmark Performance: Where V4-Pro Stands

DeepSeek’s benchmark release is detailed and invites direct comparison against Claude Opus 4.6, GPT-5.4, and Gemini-3.1-Pro — all frontier closed-source models. The results are strong, with important nuance.

Knowledge & Reasoning

On MMLU-Pro, V4-Pro scores 87.5 — matching GPT-5.4 exactly, trailing Gemini-3.1-Pro (91.0) and Claude Opus 4.6 (89.1). On LiveCodeBench, V4-Pro leads the pack at 93.5, ahead of Gemini (91.7) and Claude (88.8). Codeforces rating — a real-world competitive programming measure — puts V4-Pro at 3206, ahead of GPT-5.4 (3168) and Gemini (3052), with no score reported for Claude. On the Apex Shortlist (Pass@1), V4-Pro scores 90.2, beating Claude (85.9) and GPT-5.4 (78.1), though Gemini comes close at 89.1.

The HLE benchmark (Humanity’s Last Exam) is where the gap narrows — V4-Pro scores 37.7, just below GPT-5.4 (39.8), Claude (40.0), and Gemini (44.4). Gemini-3.1-Pro also leads on SimpleQA-Verified (75.6 vs V4-Pro’s 57.9), suggesting it retains an edge on factual world knowledge retrieval. DeepSeek acknowledges this directly: V4-Pro leads all open models but trails Gemini-3.1-Pro on rich world knowledge.

On math benchmarks, V4-Pro is world-class. IMOAnswerBench (89.8) is its best result relative to peers — well ahead of Claude (75.3) and Gemini (81.0), though GPT-5.4 edges ahead at 91.4. HMMT 2026 is the one benchmark where Claude (96.2) and GPT-5.4 (97.7) pull decisively ahead of V4-Pro (95.2).

Agentic Capabilities

This is where DeepSeek’s pitch gets interesting. SWE-Verified (real software engineering tasks) shows V4-Pro at 80.6, within a fraction of Claude (80.8) and GPT-5.4 — which has no reported score — and matching Gemini (80.6). On Terminal Bench 2.0, V4-Pro (67.9) beats Claude (65.4) and is competitive with Gemini (68.5), though GPT-5.4 leads at 75.1. Toolathlon (51.8) puts V4-Pro ahead of Claude (47.2) and Gemini (48.8) but behind GPT-5.4 (54.6).

On MCPAtlas Public, V4-Pro (73.6) beats all peers — Claude (73.8) is essentially tied, while GPT-5.4 (67.2) and Gemini (69.2) trail. BrowseComp (83.4) is near-identical to Claude (83.7), and Gemini leads at 85.9.

What V4-Flash Sacrifices

V4-Flash holds up remarkably well. On MMLU-Pro it scores 86.2 versus V4-Pro’s 87.5. LiveCodeBench: 91.6 vs 93.5. Codeforces: 3052 vs 3206. SWE-Verified: 79.0 vs 80.6. The gap is consistent but not dramatic — roughly 1–3 percentage points across most benchmarks. The bigger drop-offs are on Terminal Bench 2.0 (56.9 vs 67.9) and SimpleQA-Verified (34.1 vs 57.9), suggesting V4-Flash is weaker on tasks requiring detailed factual recall and complex multi-step tool use. For most developer use cases, though, V4-Flash is a serious model — not a stripped-down fallback.

The Bigger Picture on Benchmarks

The headline takeaway: V4-Pro is genuinely competitive with GPT-5.4 and Claude Opus 4.6 across most categories, and beats both on coding benchmarks. It trails Gemini-3.1-Pro on general knowledge and HLE, and trails GPT-5.4 on a handful of agentic tasks. For an open-source model available via API at a fraction of closed-source prices, this is a meaningful achievement. DeepSeek also notes V4-Pro is integrated with Claude Code, OpenClaw, and OpenCode — a signal that the company is serious about agentic deployment, not just benchmark positioning.

DeepSeek V4-Pro And DeepSeek V4-Flash Pricing

Both V4-Pro and V4-Flash share a 1M token context window and a maximum output of 384K tokens. They support thinking and non-thinking modes, JSON output, Tool Calls, and Chat Prefix Completion (Beta). FIM Completion (Beta) is available in non-thinking mode only.

The differentiation is primarily in pricing:

V4-FlashV4-Pro
Input (Cache Hit)$0.028 / 1M tokens$0.145 / 1M tokens
Input (Cache Miss)$0.14 / 1M tokens$1.74 / 1M tokens
Output$0.28 / 1M tokens$3.48 / 1M tokens

V4-Flash is positioned as the cost-efficient option for high-volume, latency-sensitive use cases, while V4-Pro targets developers who need more reasoning depth and can absorb higher per-token costs.

In comparison, OpenAI’s GPT 5.4 costs $2.50 per 1M input tokens and $15.00 per 1M output tokens, while Claude Opus 4.6 costs $5 per 1M input tokens and $25 per 1M output tokens. As such, DeepSeek — at least on benchmarks — delivers similar performance to these models at a 50-80% cost reduction.

Context: DeepSeek’s Trajectory

DeepSeek is no overnight story. The company, founded in 2023 by Hangzhou-based hedge fund High-Flyer, has been releasing progressively stronger models since its early days. Its V3 series drew serious attention from researchers, and R1 made the company a household name in AI circles — briefly topping the US App Store ahead of ChatGPT.

Since then, however, other Chinese labs have moved fast. Qwen from Alibaba has been consistently competitive, and DeepSeek’s open-weights models have trailed Qwen on at least some benchmarks. The V4 release, with its dual-model approach and sharp pricing, looks like a deliberate effort to reclaim relevance — both for developers choosing APIs and for enterprises evaluating model providers.

The V4 pricing is notable when benchmarked against the broader market. At $3.48 per million output tokens, V4-Pro is still a fraction of what frontier closed-source models from OpenAI and Anthropic charge. V4-Flash at $0.28 per million output tokens is firmly in budget-tier territory, competitive with the cheapest options available.

What It Means for the Market

DeepSeek’s moves consistently force the rest of the industry to respond on price. When R1 priced its outputs at $2.1 per million tokens against OpenAI’s $60, the effect was immediate — OpenAI opened its advanced models to free-tier users shortly after. A similar dynamic could play out here, particularly for the enterprise and developer segment where token costs directly affect product economics.

The dual-model structure also mirrors what US labs like Anthropic (Opus/Sonnet/Haiku) and Google (Pro/Flash) have long offered — a capable flagship paired with a cheaper, faster option. DeepSeek is now playing the same product-line game, which suggests the company is thinking beyond benchmark wins and toward sustained commercial adoption.

Whether V4-Pro’s reasoning depth justifies the price gap over V4-Flash will depend on benchmark results and real-world developer feedback — neither of which DeepSeek has been shy about publishing in the past.

Posted in AI