Gemini 3 Tops New Physics Research Benchmark, Nearly Doubles Score Over GPT 5.1

Gemini 3 continues to dominate competition, even on some brand new benchmarks.

Google’s latest Gemini 3 model has claimed the top spot on CritPt (Complex Research using Integrated Thinking – Physics Test), a newly released physics research evaluation that proves so challenging that even the leading AI system achieves only a 9.1% accuracy rate.

The benchmark, developed by over 60 researchers from more than 30 institutions including Argonne National Laboratory and the University of Illinois Urbana-Champaign, represents one of the most difficult evaluations yet created for language models. Several prominent AI systems—including Claude 4.5 Haiku, Meta’s Llama 4, Nvidia’s AI, and two versions of Anthropic’s models—failed to solve even a single problem correctly, scoring 0%.

Google’s Gemini 3 Pro Preview nearly doubled the performance of its closest competitor, with GPT 5.1 from OpenAI trailing at 4.9%. The gap widens further down the leaderboard: Grok 4.1 Fast scored 2.9%, while Gemini 2.5 Pro and Kimi K2 Thinking 2 both achieved 2.6%.

The benchmark tests models on graduate-level physics research problems spanning 11 subdomains, from condensed matter and quantum physics to astrophysics and biophysics. Each of the 70 challenges in the test set is designed to represent a standalone project suitable for a capable junior PhD student—requiring deep understanding and reasoning on frontier physics problems that don’t appear in publicly available materials.

What makes CritPt particularly noteworthy is its real-world applicability. Questions and answers were written and verified by experts active in their subfields, including postdoctoral researchers and physics professors. The evaluation doesn’t allow tool use, forcing models to rely purely on their reasoning capabilities.

The benchmark developers, some of whom previously worked on leading evaluations like SciCode and SWE-Bench, designed CritPt to reflect genuine research assistant capabilities. Many models failed to solve even a single problem despite being given five attempts, underscoring the substantial gap between current AI systems and true research-level physics expertise. And while Gemini 3 leads the pack with a score of 9%, like Google DeepMind CEO Demis Hassabis recently said, AI progress at the moment does seem to be quite some way away from true AGI.