Google’s Gemini 3 was already topping most benchmarks, but it also seems to be topping blind tests by users.
Google’s Gemini 3 Pro model has topped most major leaderboards on LMArena. LMArena lets users pick which they think is the better result from two AI models. The models are anonymized, and LMArena is then able to create a ranking of the AI models it tests. LMArena has said that Gemini 3 Pro became the top model on all its major leaderboards through its blind tests. Gemini 3 had also become the top model on Artificial Analysis’s Intelligence Index, which measures model capabilities on 10 evaluations across math, science and coding.
Dominant Performance Across Core Categories
The announcement marks a significant achievement for Google DeepMind, with Gemini 3 Pro securing first place across LMArena’s primary evaluation tracks. The model achieved a score of 1501 in the main Arena leaderboard, edging out xAI’s grok-4.1-thinking (1484) and Anthropic’s Claude Sonnet 4.5 (1449).

On the WebDev leaderboard, which tests coding capabilities, Gemini 3 Pro scored 1487, outperforming GPT-5-medium and Claude Opus 4.1.

Gemini 3 Pro ranks third on the Arena Expert leaderboard with a score of 1507, trailing Claude Sonnet 4.5 thinking (1510) and grok-4.1-thinking (1509) by just a few points. When accounting for statistical uncertainty margins, the model is effectively tied for first place.

Massive Improvements Over Gemini 2.5
The performance gains represent a substantial leap from Google’s previous generation. On the WebDev leaderboard, Gemini 3 Pro scored 280 points higher than Gemini 2.5 Pro (1204). The improvements extend across other categories as well, with a 50-point increase in the main text Arena (from 1451 to 1501) and a 70-point jump in vision capabilities (from 1258 to 1328).
Leading Specialized Categories
Gemini 3 Pro’s dominance extends beyond general performance to specialized evaluation tracks. The model ranks first across all major overview categories tracked by LMArena, including Hard Prompts, Coding, Instruction Following, Creative Writing, Multi-Turn conversations, Longer Query handling, and Mathematical reasoning.
Professional Use Cases
On LMArena’s Occupational Leaderboards, which evaluate model performance for specific professional fields, Gemini 3 Pro leads across nearly all tested domains. The model secured top rankings in Software & IT Services, Writing, Literature & Language, Life, Physical & Social Science, Entertainment, Sports & Media, Mathematical fields, and Legal & Government applications.
Competitive Landscape
The achievement positions Google ahead of formidable competition from OpenAI, Anthropic, and xAI. Gemini 3 Pro’s performance surpasses GPT-5, Claude 4.5, and Grok-4.1 across multiple benchmarks, signaling an intensifying race among major AI labs to deliver superior language models.
LMArena’s blind testing methodology provides a particularly valuable signal for real-world performance, as it reflects actual user preferences rather than performance on engineered benchmarks alone. The consistent top rankings across diverse categories suggest Gemini 3 Pro delivers strong general-purpose capabilities that translate to practical applications.