Video App Zoom Shows Surprising Result By Topping Humanity’s Last Exam Benchmark, Beats Gemini 3 Pro

Topping AI benchmarks are usually thought to be the preserve of the top four AI frontier labs, but a surprising new name has emerged on the Humanity’s Last Exam benchmark.

Zoom, the video conferencing platform, has announced it achieved a state-of-the-art score of 48.1% on the Humanity’s Last Exam (HLE) full-set benchmark, surpassing Google’s Gemini 3 Pro with tools, which previously held the top position at 45.8%. The 2.3 percentage point improvement marks a significant achievement for a company better known for video calls than AI research.

Understanding Humanity’s Last Exam

The Humanity’s Last Exam benchmark represents one of the most rigorous tests in artificial intelligence, designed to evaluate models across diverse domains requiring expert-level knowledge and sophisticated reasoning. Unlike simpler benchmarks that may rely on pattern matching, HLE demands genuine understanding, multi-step reasoning, and the ability to synthesize information across complex, interconnected problems.

Developed by subject-matter experts globally, the benchmark has become a crucial metric for measuring AI’s progress toward human-level performance on challenging intellectual tasks. The relatively low scores—even the top performer barely crosses 48%—underscore just how difficult these problems are.

Zoom’s Federated AI Approach

The company’s success stems from what it calls a “federated AI approach,” which combines multiple language models rather than relying on a single system. According to Xuedong Huang, Zoom’s Chief Technology Officer and a former Technical Fellow at Microsoft, this strategy leverages the unique strengths of different models while introducing novel architectural innovations.

At the core of Zoom’s system is an “explore-verify-federate” strategy, an agentic workflow that balances exploratory reasoning with rigorous verification. Rather than generating extensive reasoning traces, the method strategically identifies and pursues the most informative reasoning paths.

The federated framework orchestrates diverse models to generate, challenge, and refine reasoning through what Zoom describes as dialectical collaboration. This enables each model to contribute its distinctive strengths, while a comprehensive verification phase integrates the complete context to determine the most accurate solution.

Zoom’s proprietary “Z-scorer” system selects or refines outputs from various models—including the company’s own small language models alongside advanced open-source and closed-source options—for optimal performance.

Scaffolding Gains

Scaffolding third-party models along with some internal models, workflows and tweaks appears to be a new technique that’s showing impressive results on benchmarks. Earlier this week, Poetiq had demonstrated a state-of-the-art performance on ARC-AGI 2 benchmark using third-party LLMs, and had delivered better results than Gemini 3 Pro at half the price.

Real-World Applications

Huang emphasized that the breakthrough has immediate practical implications for Zoom users, including more accurate meeting summaries and action item extraction, enhanced cross-platform information retrieval and synthesis, and improved handling of complex, multi-step business processes through agentic workflow automation.

The company positioned its achievement as part of a collaborative rather than competitive vision for AI development. “The future of AI lies not in isolation, but in intelligent orchestration,” Huang wrote in announcing the results.

While the top AI labs continue to push the boundaries with their frontier models, Zoom’s success demonstrates that innovative architectural approaches—combining multiple models in sophisticated ways—can potentially rival or exceed the performance of any single system, even those from the industry’s most well-resourced research organizations.

Posted in AI