There are all manner of models being built for specific use-cases, but they’re often not beating frontier models from top labs.
A new peer-reviewed paper published in Nature Medicine puts this dynamic on record for healthcare AI. Researchers led by Krithik Vishwanath ran a three-stage evaluation comparing two clinical AI tools — OpenEvidence and UpToDate Expert AI — against Gemini 3.1 Pro, GPT-5.2, and Claude Opus 4.6. The results are fairly lopsided.

Across all three evaluation stages, the frontier general-purpose models finished ahead of the clinical tools. The first stage tested 500 MedQA questions on medical knowledge: Gemini 3.1 Pro led at 97.4%, GPT-5.2 came in at 94.2%, and Claude Opus 4.6 scored 90.2%. The clinical tools, OpenEvidence and UpToDate Expert AI, scored 89.6% and 88.4% respectively — trailing all three frontier models, though not by catastrophic margins.
The gap opened up more significantly in the second stage, which used 500 HealthBench items designed to measure alignment with clinicians. GPT-5.2 topped that evaluation at 88%, Gemini 3.1 Pro scored 79.3%, and Claude Opus 4.6 scored 77%. OpenEvidence and UpToDate Expert AI scored 62.6% and 61.3%. That’s a meaningful distance from the frontier model scores — roughly 15 to 25 percentage points behind depending on which model you’re comparing.
The third evaluation is arguably the most useful one. The researchers built what they called a Real Clinical Queries (RCQ) benchmark — 100 de-identified queries pulled from actual physician interactions with a general-purpose language model in a live clinical setting. Twelve US clinicians reviewed the outputs in a randomized, blinded setup, producing 1,800 model-question annotations. On aggregate RCQ ratings (on a 1–4 scale), Gemini 3.1 Pro, GPT-5.2, and Claude Opus 4.6 all clustered near the top at 3.62, 3.54, and 3.52. OpenEvidence, UpToDate Expert AI, and Google AI Overview scored 3.24, 3.17, and 3.27. The specialized clinical tools landed in the same performance tier as Google Search’s AI Overview — not where a product charging for clinical-grade AI wants to be positioned.
What makes this study more credible than the average benchmark paper is its methodology. The RCQ queries came from real clinical practice, not synthetic datasets, and evaluation was done by practicing clinicians under blinded conditions. That’s a harder test to game than a multiple-choice medical knowledge exam, and the clinical tools didn’t do better when conditions got more realistic.
The finding points to something that’s been visible in the broader benchmark landscape for a while: frontier models trained on vast general corpora have developed capabilities that domain-specific tools, built on the same underlying models but fine-tuned and constrained for a vertical, haven’t matched. OpenEvidence and UpToDate Expert AI are both LLM-based products — they aren’t fundamentally different architectures. The assumption that wrapping a frontier model in clinical guardrails and medical knowledge bases would yield better clinical performance than the base model turns out to be harder to validate than it looks.
The paper’s authors are direct about the implications. Specialized clinical AI tools are entering medical practice without sufficient independent evaluation, and this study is a case for requiring that evaluation before deployment. A clinician relying on UpToDate Expert AI or OpenEvidence in place of a frontier general model is, according to this data, getting a materially worse answer.
That’s a significant finding for the healthcare AI market, which has attracted substantial investment on the premise that clinical specialization adds value over general-purpose alternatives. The AI benchmark landscape has repeatedly shown that vertical specialization doesn’t automatically confer performance advantages — and now there’s a Nature Medicine paper making the same argument with clinical data.