Google Gemini 3 Becomes First Model To Score Higher Than Human Radiology Trainees In New Benchmark

Google Gemini 3 has topped most conventional AI benchmarks, but it’s also outstripping humans in some unusual — but just as critical — ones.

Gemini 3.0 Pro has become the first generalist artificial intelligence model to outperform radiology trainees on a challenging diagnostic benchmark, according to research published today by the Centre for Responsible Autonomous Systems in Healthcare (CRASH Lab) at Ashoka University. The model achieved 51% accuracy on the Radiology’s Last Exam (RadLE v1) benchmark, surpassing the 45% scored by radiology trainees, though it still trails board-certified radiologists who reached 83%.

The achievement represents a notable leap forward in AI’s medical reasoning capabilities. In evaluations conducted just two months earlier in September 2025, every major frontier model—including GPT-5, Gemini 2.5 Pro, OpenAI’s o3, and Claude Opus 4.1—had performed below trainee level on the same benchmark.

“For the first time in our evaluations, a generalist AI model has crossed radiology-trainee level performance,” the research team noted in their findings released November 20, 2025.

A Benchmark Designed to Challenge

The RadLE v1 benchmark isn’t a typical medical test. Developed by CRASH Lab’s Koita Centre for Digital Health, it consists of 50 deliberately difficult cases spanning CT scans, MRIs, and radiographs—the kind of complex, multi-system diagnostic puzzles that routinely challenge even experienced radiologists. The dataset is intentionally “spectrum biased” to reflect real-world diagnostic difficulty rather than easy wins.

What makes the Gemini 3.0 Pro results particularly striking isn’t just the raw numbers, but the qualitative improvement in reasoning that the researchers observed. In one acute appendicitis case that stumped earlier models, including GPT-5, the new Gemini model demonstrated markedly more sophisticated diagnostic thinking.

Where GPT-5 had jumped between multiple unrelated diagnoses—initially suggesting intussusception, then pivoting to Crohn’s disease, and finally settling on the wrong answer—Gemini 3.0 Pro followed a structured, radiologist-like approach. It correctly identified the appendix anatomically, described specific imaging features like wall enhancement and periappendiceal fat stranding, systematically ruled out diagnostic mimics, and arrived at the correct diagnosis of acute appendicitis with confidence.

“The reasoning progressed in stable, sequential steps rather than jumping between diagnoses,” the researchers wrote, highlighting how Gemini 3.0 displayed “focused, anatomically grounded reasoning” compared to the uncertain pattern-matching of its predecessors.

The Performance Gap Remains

Despite the milestone, the research team emphasized that significant limitations remain. At 51-57% accuracy (the model scored slightly higher when tested via API with extended reasoning), Gemini 3.0 Pro still operates at roughly half the diagnostic capability of board-certified radiologists.

This performance gap has important implications for deployment. “We are still far from readiness for deployment, autonomy or diagnostic replacement,” the researchers cautioned, noting that while progress is accelerating faster than many expected, AI systems aren’t yet approaching the reliability needed for independent clinical decision-making.

The benchmark results also reveal a hierarchy of performance that places the technology in context. Board-certified radiologists led with 83% accuracy, followed by Gemini 3.0 Pro at 51%, radiology trainees at 45%, and earlier models trailing further behind—GPT-5 thinking at 30%, Gemini 2.5 Pro at 29%, OpenAI’s o3 at 23%, and Grok 4 at just 12%. Claude Opus 4.1 registered the lowest at 1%.

Implications for Medical AI

The advancement comes at a time of intense interest in AI’s potential to augment healthcare delivery, particularly in specialties facing workforce shortages. Radiology has long been viewed as a domain ripe for AI assistance given its reliance on pattern recognition in imaging data.

Google has not yet widely released Gemini 3.0 Pro to the public, though the model is available in preview on Google AI Studio. The CRASH Lab team tested both the web interface version and ran the model multiple times via API with high-thinking mode enabled, finding consistent results across three separate runs that averaged 57% accuracy.

The researchers plan to expand their work with RadLE v2, incorporating larger datasets and more granular scoring of both diagnostic accuracy and reasoning quality. Their stated goal remains “transparent benchmarking to independently measure progress of multimodal reasoning capabilities of AI models in radiology.”

For now, Gemini 3.0 Pro’s performance suggests that AI is moving from a tool that might assist with routine cases to one that could potentially match junior clinicians on complex diagnostic challenges—though with the critical caveat that it still requires expert oversight and remains far from replacing human judgment in clinical settings.