NVIDIA isn’t only making the chips that are powering the AI revolution, but it’s also making some world-class AI models as well.
At Jensen Huang’s Computex keynote, NVIDIA announced the release of Nemotron 3 Ultra — the final and largest member of the Nemotron 3 family, which debuted in December 2025 starting with the Nano variant. With 550 billion total parameters and 90% sparsity (55B active per token), it is both the largest Nemotron 3 model to date and the most intelligent open weights model released by a US lab.
Leading the US Open Weights Pack
According to an evaluation conducted by Artificial Analysis in partnership with NVIDIA, Nemotron 3 Ultra scores 48 on the Artificial Analysis Intelligence Index — well ahead of the next strongest US open weights models: Gemma 4 31B (39), Nemotron 3 Super (36), and gpt-oss-120b (33).
That said, it still trails the Chinese-led open weights frontier. Kimi K2.6 leads at 54 on the index, followed by GLM-5.1 at 51 and MiniMax-M2.7 at 49. The US-China gap in open weights intelligence remains real — but Nemotron 3 Ultra narrows it significantly.
Speed Is Where It Pulls Away
The more striking advantage is inference speed. On a pre-release DeepInfra endpoint, Nemotron 3 Ultra served over 300 tokens per second — roughly 3–6x faster than peer models from Chinese labs like DeepSeek and Moonshot’s Kimi, which typically run at 50–100 tokens per second in the market today. Even gpt-oss-120b, which matches Ultra’s speed tier, scores a full 15 points lower on intelligence.

This positions Nemotron 3 Ultra squarely in what Artificial Analysis calls the “most attractive quadrant” — high intelligence combined with high output speed — a combination no other model currently occupies.
Benchmark Performance
NVIDIA’s own slides show a mixed but competitive picture across tasks:
| Benchmark | Nemotron 3 Ultra | GLM 5.1 | Kimi K2.6 | Qwen3.5 |
|---|---|---|---|---|
| Agent Productivity | 91% | 84% | 91% | 89% |
| Long-Horizon Planning | 33% | 40% | 29% | 30% |
| Coding | 54% | 64% | 67% | 53% |
| Instruction Following | 82% | 77% | 74% | 78% |
| Professional Work Tasks | 56% | 46% | 56% | 53% |
| Long Context | 95% | N/A | N/A | 90% |
Ultra wins on instruction following, professional tasks, and long context. It trails on coding and long-horizon planning — areas where Chinese models, particularly Kimi K2.6 and GLM 5.1, still have an edge.
Architecture Built for Speed
Under the hood, Nemotron 3 Ultra runs a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture — the same design philosophy introduced in the Nano and Super variants, but scaled massively. Key technologies include:
- LatentMoE: Compresses tokens into a low-rank latent space before routing, enabling 4x as many expert specialists at the same inference cost.
- Multi-Token Prediction (MTP): Predicts multiple future tokens in a single forward pass, improving chain-of-thought coherence and enabling built-in speculative decoding.
- 1M Token Context Window: Mamba-2 layers provide linear-time complexity over sequence length, making long-document and agentic workloads practical.
- NVFP4 Quantization: The model will be made available in NVFP4 precision (in addition to BF16 weights) for higher inference performance on Blackwell GPUs.
Open Weights, But Not Quite Yet
Nemotron 3 Ultra is currently being released as a pre-training base checkpoint — it has not undergone instruction tuning or post-training alignment. NVIDIA describes it as the best possible starting point for fine-tuning on domain data, reinforcement learning post-training, or custom instruction pipelines. A fully post-trained version is expected to follow.
The Nemotron 3 Super — the 120B-parameter middle sibling — is already fully released with weights, datasets, and training recipes on Hugging Face, and is available on Amazon Bedrock. Ultra is expected to follow the same open release trajectory.
The Bigger Picture
NVIDIA’s move into frontier model development is no longer a side story. The company that supplies the hardware running virtually every major AI system is now competing at the model layer too — and doing it openly. For enterprises building agentic systems, a model that combines near-frontier intelligence with 300+ tokens-per-second throughput changes the economics of deployment.
The Chinese-led open weights frontier is still ahead on raw intelligence. But NVIDIA’s Nemotron 3 Ultra makes a compelling case that the gap is closeable — and that speed, openness, and hardware-model co-optimization may matter just as much as benchmark scores.