Google Gemini continues to dominate benchmarks that weren’t revealed as a part of its model release earlier this week.
The company’s Gemini 3 Pro Preview has achieved the highest scores on FrontierMath, a challenging benchmark designed to test AI systems on expert-level mathematics problems that typically take human specialists hours or even days to solve. The benchmark, supported by OpenAI, consists of several hundred unpublished mathematics problems spanning from undergraduate to research-level difficulty.
On the benchmark’s most difficult Tier 4 problems—which represent research-level mathematics—Gemini 3 Pro Preview achieved an accuracy of 18.8%, correctly solving 9 out of 48 problems. This puts it significantly ahead of OpenAI’s best-performing models, with GPT-5.1 (high), GPT-5 (high), and GPT-5 Pro all tied at 12.5% accuracy with 6 correct answers each. Google’s own Gemini 2.5 Deep Think followed with 10.4% accuracy.

The performance gap widened even further on Tier 1-3 problems, which cover undergraduate through early graduate level mathematics. Gemini 3 Pro Preview scored 37.6% accuracy, solving 109 out of 290 problems—substantially outperforming GPT-5 (high) at 32.4% and GPT-5.1 (high) at 31.0%. Gemini 2.5 Deep Think achieved 29.0% on these problems.

Notably, Anthropic’s Claude models appeared further down both leaderboards. Claude Sonnet 4.5 with extended thinking scored 4.2% on Tier 4 problems, while Claude Opus 4.1 with extended thinking matched that performance. On the easier Tier 1-3 problems, neither Claude model appeared in the top 10.
The FrontierMath results underscore Google DeepMind’s strength in mathematical reasoning, an area the company has historically prioritized with projects like AlphaGeometry and AlphaProof. The benchmark’s use of unpublished problems helps ensure that models haven’t simply memorized solutions during training, making it a more rigorous test of genuine mathematical reasoning capabilities.
These results come just days after Google’s Gemini 3 announcement, suggesting the company may have been holding back some of its most impressive benchmark performances during the initial reveal. In addition, Google’s Gemini 3 had also topped a new Physics benchmark that tested AI models on questions designed by professional researchers. And with Gemini 3 doing well on Physics and Math benchmarks, Google could be well positioned to make scientific breakthroughs that’ll not only move research forward, but will also lead to automating AI development that could create a flywheel that could accelerate human progress dramatically in the coming years.