Google DeepMind Releases AI Co-Mathematician That Creates New High Score On FrontierMath Benchmark

The top frontier labs are now making big moves in coding and math with their new releases.

Google DeepMind has introduced the AI co-mathematician, a multi-agent workbench designed to collaborate with human researchers on open-ended mathematical problems. Built on the latest Gemini models, the system scored 48% on FrontierMath Tier 4 — a new high among all AI systems evaluated on what Epoch AI describes as a set of problems “designed to surpass Tier 3 in difficulty, with some potentially remaining unsolved by AI for decades.”

Not A Chatbot — A Research Collaborator

The key distinction the paper’s authors are making is architectural. This isn’t a model you query for an answer. It’s a stateful, asynchronous workspace where a hierarchy of agents — coordinated by a top-level “project coordinator” — work in parallel across multiple research workstreams, managing uncertainty, tracking failed hypotheses, and producing LaTeX write-ups complete with margin annotations and provenance notes.

The system is explicitly designed around the messiness of real mathematical research: iterative refinement of questions, literature synthesis, computational experimentation, and the preservation of dead ends as first-class outcomes. As the paper puts it, “knowing what does not work is often as important as knowing what does.”

This philosophy mirrors what Google DeepMind has been building toward with systems like AlphaEvolve, which discovered new algorithms for matrix multiplication and cracked Ramsey number puzzles that had stumped researchers for decades — but where AlphaEvolve was an autonomous search engine, the AI co-mathematician is built for sustained human-AI collaboration.

Real Results From Real Mathematicians

The early case studies are striking. Marc Lackenby, a mathematician at Oxford, used the system to resolve an open problem from the Kourovka Notebook (Problem 21.10 in group theory), after a reviewer agent spotted a flaw in the AI’s first proof attempt — and Lackenby realized he knew how to fill the gap. The back-and-forth was the point.

Gergely Bérczi used it to obtain claimed proofs for conjectures about Stirling coefficients for symmetric power representations. Semon Rezchikov posed a technical subproblem in Hamiltonian systems and received a key lemma that “withstood careful checking.” Rezchikov noted that other AI systems had failed on the same prompt, and added: “I would rank, aesthetically, its general style of proofs as the best one of any models I’ve gotten to use.”

All three noted something important: the system works best when the mathematician is already familiar with the domain, and knows how to steer it.

The Benchmark Numbers

On FrontierMath Tier 4, the AI co-mathematician correctly solved 23 of 48 non-public problems — a 48% accuracy rate. For context, the underlying Gemini 3.1 Pro base model scored 19% on the same benchmark. The delta is attributable to the system’s parallel investigation branches, enforced review cycles, literature access tools, and persistent code execution infrastructure.

The image provided with this story shows where the system sits on the current FrontierMath leaderboard: ahead of GPT-5.5 Pro at 39.6%, GPT-5.4 Pro at 37.5%, and well ahead of Claude Opus 4.7 and 4.6 at 22.9%. Three of the problems solved had not been cracked by any previously evaluated system.

An important caveat: unlike other systems evaluated with Epoch AI’s standard agentic harness — which imposes hard token limits — the AI co-mathematician ran with no cap on model calls or tokens generated, meaning its inference cost is meaningfully higher. This makes direct comparisons somewhat uneven, though the performance gap at the top is significant regardless.

Why This Matters Beyond Benchmarks

The authors make a deliberate argument that static problem-solving benchmarks are becoming insufficient measures of AI progress in mathematics. Frontier labs are releasing models faster than ever, and raw problem-solving ability — where systems already perform at or above expert human level — is no longer the bottleneck. The harder challenge is orchestration: managing multi-week research arcs, synthesizing niche literature, appropriately disclosing uncertainty, and knowing when to ask a human for help.

The paper explicitly compares this to what coding agents like Claude Code and Google Antigravity have done for software development — providing the scaffolding that lets AI work autonomously over long horizons while staying steerable. The authors argue that mathematics has lacked an equivalent, and the AI co-mathematician is an attempt to provide one.

OpenAI President Greg Brockman has predicted AI could solve a Millennium Prize Problem within two to five years. DeepMind CEO Demis Hassabis has argued that frontier labs with strong math and coding tools are starting to pull away from the rest, precisely because those tools compound. The AI co-mathematician is a direct expression of that thesis.

The Risks The Paper Doesn’t Shy Away From

The authors are candid about failure modes. The review cycle between agents can converge on arguments that remain subtly flawed — what they call “reviewer-pleasing bias” — where errors become undetectable rather than corrected. It can also spiral in the opposite direction, with agents locked in endless disagreement. Early users have learned to spot when a workstream has entered a “death spiral” and down-weight its outputs accordingly.

There are also broader concerns about what happens to mathematical publishing when AI can generate a 20-page proof attempt in minutes while human peer reviewers take days. The paper flags this as a systemic risk — not just noise in the literature, but a structural strain on volunteer-driven peer review that the community will need to address deliberately.

The system is currently in limited release. According to Pushmeet Kohli, Chief Scientist at Google Cloud, Google’s goal is to develop future products that grant much broader access to this paradigm.