Anthropic’s Models Solved 30% Of Bioinformatics Problems That Stumped Human Scientists On New BioMysteryBench Eval

Apart from math and code, AI models are now making their presence felt in the sciences as well.

Anthropic has published results from BioMysteryBench, an internally developed bioinformatics benchmark comprising 99 expert-authored questions drawn from real-world datasets. The results show that its latest models not only match trained scientists on most tasks — they outperform panels of five domain experts on a meaningful slice of problems that humans couldn’t solve at all.

What BioMysteryBench Tests

The benchmark tasks Claude with analyzing real biological data — whole genome sequencing, single-cell RNA-seq, ChIP-seq, metagenomics, proteomics, metabolomics — and answering specific questions about it. Examples include identifying which human organ a cell-type dataset was derived from, determining which gene was knocked out in experimental samples, and inferring parentage from whole-genome sequences.

Crucially, every question is grounded in a verifiable, objective answer — not a scientist’s subjective interpretation. A question like “What viral species is the patient infected with, based on RNA-seq data?” has a definitive answer validated by a PCR assay. This makes evaluation clean, even when the analysis required to get there is anything but.

Claude runs inside a container with canonical bioinformatics tools, the ability to install additional packages via pip and conda, and access to databases like NCBI and Ensembl. It is graded on its final answer, not on which analytical path it took to get there.

The Results

Of the 99 questions, 76 were solved by at least one human expert and classified as “human-solvable.” On this set, Claude Sonnet 4.6 and above performed on par with trained bioinformaticians, with Anthropic’s models solving the majority reliably across multiple attempts.

The more striking result comes from the 23 “human-difficult” problems — questions that a panel of domain experts could not answer. Claude Mythos Preview, Anthropic’s unreleased frontier model, solved 30% of these. Claude Opus 4.6 and Claude Sonnet 4.6 also cleared meaningful fractions, though at lower rates.

How Claude Did It

Anthropic identified two strategies that gave Claude an edge on human-difficult problems.

The first is sheer breadth of knowledge. Claude’s training on hundreds of thousands of papers allows it to synthesize information across mechanisms, ontologies, and meta-analyses on the fly — tasks that would require a human to manually stitch together multiple databases or run a meta-analysis of their own.

The second is a methodological habit that proved particularly valuable on uncertain problems: when Claude wasn’t confident, it tried multiple approaches and defaulted to the answer that several methods converged on. That triangulation strategy helped it break through on problems where a single analytical path would have dead-ended.

Claude also occasionally took paths that diverged entirely from human methods. On some tasks, human experts used established algorithms or reference databases, while Claude recognized structural or sequence-level patterns directly from the data — a form of intuition that’s historically been difficult to encode in traditional bioinformatics software.

Reliability Is the More Interesting Story

A secondary analysis, conducted by Claude Mythos Preview itself on its predecessors’ performance data, surfaced a nuance that raw accuracy numbers obscure.

On human-solvable problems, Claude Opus 4.6 solved 86% of the problems it got right at least four out of five times — a strongly bimodal pattern. It either knew the answer or it didn’t. On human-difficult problems, that reliability collapsed: only 44% of its correct answers came from problems it solved four or more times out of five, while 44% were one- or two-shot wins, indicating lucky reasoning paths rather than a reproducible method.

Claude Mythos Preview partially addresses this — it pushes the reliable-solve rate back up to 94% on human-solvable tasks — but the same brittle pattern holds on human-difficult problems across every model tested. The headline accuracy gap between easy and hard tasks is real, but the reliability gap underneath it is more telling about where the actual capability frontier sits.

External Validation

Anthropic’s findings are echoed by an independent effort from Genentech and Roche, who released CompBioBench around the same time. Their benchmark, built on 100 computational biology tasks using synthetic and augmented real data, found that Claude Opus 4.6 reached 81% overall and 69% on their hardest questions — reinforcing that frontier models are now genuinely useful research collaborators, not just code generators.

The Bigger Picture

AI models have been stacking up wins on knowledge-based science benchmarks for a while. GPQA Diamond — which tests graduate-level reasoning in biology, physics, and chemistry without any tools — is now approaching saturation at the top, with Claude Opus 4.7 scoring 94.2% and Gemini 3.1 Pro at 94.3%. Humanity’s Last Exam, designed to be AI-resistant, is being steadily chipped away.

BioMysteryBench is a different kind of signal. It’s not testing what a model knows in isolation — it’s testing whether a model can take raw, messy biological data, figure out what to do with it, and arrive at the right scientific conclusion. The fact that Claude is now solving problems that trained bioinformaticians cannot is a meaningful step toward AI as a genuine research collaborator, not just a capable assistant.

Posted in AI