Google Releases Gemini 3.1 Pro, Beats Claude Opus 4.6, GPT 5.2 On Most Benchmarks

Google appears to be back at the top of the AI model pile with the release of its latest model — Gemini 3.1 Pro.

Gemini 3.1 Pro has topped the charts across the majority of industry-standard benchmarks, besting rivals including Anthropic’s Claude Opus 4.6 and OpenAI’s GPT-5.2. The results signal a significant leap forward for Google DeepMind and suggest the intensely competitive AI race is far from settled.

“Gemini 3.1 Pro is here,” said Google CEO Sundar Pichai. “Hitting 77.1% on ARC-AGI-2, it’s a step forward in core reasoning (more than 2x 3 Pro). With a more capable baseline, it’s great for super complex tasks like visualizing difficult concepts, synthesizing data into a single view, or bringing creative projects to life,” he added.

Gemini 3.1 Pro Benchmarks: Dominant Performance

Gemini 3.1 Pro posted leading scores on 13 of the 16 benchmarks evaluated by Google. The model scored 94.3% on GPQA Diamond, a test of expert-level scientific knowledge, outpacing Claude Opus 4.6 (91.3%) and GPT-5.2 (92.4%). On the ARC-AGI-2 abstract reasoning puzzles benchmark — widely considered one of the hardest tests for AI systems — Gemini 3.1 Pro achieved 77.1%, dwarfing the next-closest competitor, Opus 4.6, which scored 68.8%.

In agentic and coding tasks, traditionally an area of strength for rivals, Google’s new model also impressed. Gemini 3.1 Pro scored 80.6% on SWE-Bench Verified (agentic coding) and 68.5% on Terminal-Bench 2.0, both first place finishes. On the APEX-Agents benchmark for long-horizon professional tasks, it posted a commanding 33.5%, nearly double Gemini 3 Pro’s 18.4% and well ahead of GPT-5.2’s 23.0% and Opus 4.6’s 29.8%.

Where Rivals Held Their Ground

The results were not a clean sweep for Google. Claude Sonnet 4.6, in its Thinking (Max) configuration, tied Gemini 3.1 Pro on MRCR v2 long-context performance (both scoring 84.9% on the 128k average test) and led the field on GDPval-AA Elo expert tasks with a score of 1633, compared to Gemini 3.1 Pro’s 1317. Claude Opus 4.6 also edged out Gemini 3.1 Pro on the Humanity’s Last Exam with-tools category, scoring 53.1% versus 51.4%.

OpenAI’s GPT-5.3-Codex showed particular strength in terminal coding tasks, leading the Terminal-Bench 2.0 “other best self-reported harness” category with 77.3%, and topped SWE-Bench Pro (Public) with 56.8%, ahead of Gemini 3.1 Pro’s 54.2%. However, GPT-5.3-Codex scores were only reported for a small subset of benchmarks, making direct overall comparison difficult.

Gemini 3.1 Pro Rollout

Starting today, Gemini 3.1 Pro is rolling out across Google’s ecosystem in a staged release. Developers can access the model in preview via the Gemini API in Google AI Studio, Gemini CLI, Google’s agentic development platform Google Antigravity, and Android Studio. Enterprise customers will find it available through Vertex AI and Gemini Enterprise, while consumers can access it directly via the Gemini app and NotebookLM.

Implications for the AI Industry

The release of Gemini 3.1 Pro is likely to intensify competition among the big three AI labs — Google DeepMind, Anthropic, and OpenAI — all of which have released powerful flagship models within the past year. Benchmark leadership has become a key marketing differentiator in enterprise AI sales, and Google will no doubt be highlighting these results to customers.

Google’s strong showing on agentic benchmarks — including MCP Atlas (69.2%), BrowseComp (85.9%), and t2-bench Telecom (99.3%) — is particularly notable as the industry shifts focus from raw question-answering ability toward AI agents capable of executing complex, multi-step workflows in the real world.

Gemini 3.1 Pro shows that Google is continuing to innovate in the AI models space. Google had created its first model which caught the imagination of the developer community with Gemini 2.5 Pro last year, and had followed it up with Gemini 3 Pro, which had crushed competition on most benchmarks. But since its release in November, Anthropic had released Opus 4.5, and OpenAI had released GPT 5.2, both of which had outperformed it. But just three months later, Google has released a model that again puts it on the top of the AI pile. Given how things have gone over the past year, it’s likely that OpenAI and Anthropic will soon release a better model soon, but Google, for now, seems to have the most powerful AI model around with Gemini 3.1 Pro.