GPT 5.4 (xhigh) Scores 95% On 2026 US Math Olympiad, Gemini 3.1 Pro Second With 74%

AI models are continuing to make rapid strides in math and science.

GPT-5.4, OpenAI’s current flagship, has scored 95.24% on the 2026 USA Math Olympiad (USAMO), according to a new evaluation by MathArena. Gemini 3.1 Pro finished second at 74.4%, followed by Claude Opus 4.6 at 47%, and open-source model Step-3.5-Flash at 44.6%.

The jump is striking in context. A year ago, on USAMO 2025, the same class of models produced solutions riddled with circular arguments, unsupported guesses, and incoherent structure. In 2026, those failure modes are largely gone.

What Changed

The improvement isn’t just in scores — it’s in the nature of the errors. In 2025, models frequently guessed rather than proved. In 2026, the remaining mistakes are subtler: open models occasionally slip back into chain-of-thought reasoning mid-proof without completing the argument, and Opus 4.6 ran out of its 128,000-token budget on 4 of 24 attempts, three of them on a single problem (Problem 2).

GPT-5.4’s only notable error was on Problem 5, where one of its runs incorrectly argued the statement was false and produced an invalid counterexample — a surprising stumble for an otherwise dominant performance.

The Cost Gap

The benchmark also highlights a significant cost disparity. GPT-5.4 (xhigh) cost $5.15 per run. Gemini 3.1 Pro, which has established itself as cost-efficient at the frontier, cost just $2.20. Claude Opus 4.6 was the most expensive at $13.23 — nearly 2.6x the cost of GPT-5.4 — for a score less than half as high. Step-3.5-Flash, the strongest open model, ran at just $0.22.

Grading at Scale

MathArena built a semi-automated grading pipeline for the evaluation, using a jury of three models — GPT-5.4, Gemini 3.1 Pro, and Opus 4.6 — rather than a single judge. The jury approach was designed to counter two documented problems with LLM-based grading: self-bias (models scoring their own outputs more generously) and formatting bias (rewarding verbose or polished-looking solutions).

The pipeline’s accuracy held up well under human review: final scores shifted by at most two points, and only for three solutions. Notably, GPT-5.4 was the most reliable judge, while Gemini 3.1 Pro and Opus 4.6 both significantly inflated scores for their own outputs.

What It Means

The USAMO result adds to a growing body of evidence that frontier AI is closing in on expert-level mathematical reasoning. GPT-5.4’s benchmark dominance has been consistent across categories since its release, and the USAMO score — near-saturation on one of the most rigorous high school math competitions in the world — underscores how rapidly the ceiling has moved. And if this scorching pace of development continues, the big breakthroughs in science that top voices in AI have been promising might soon come to fruition.

Posted in AI