China’s GLM 5.2 Beats All OpenAI, Google Models On GDPval-AA Benchmark For Real-World Tasks

Chinese models aren’t merely 6 months behind US labs — they now seem to be better than anything produced by top US labs.

GLM-5.2, the latest model from Beijing-based Knowledge Atlas (Z.AI), has placed third on GDPval-AA v2, Artificial Analysis’s benchmark for real-world, economically valuable knowledge work. The model scored 1524 Elo on the leaderboard, which is anchored to a human baseline of 1,000. Above it are only Claude Fable 5 (1783) and Claude Opus 4.8 (1615) — two Anthropic models. Every OpenAI and Google model sits below it.

GPT-5.5 at its highest reasoning setting scores 1509. Gemini 3.5 Flash, the best Google model on this leaderboard, lands at 1357. GLM-5.2 is ahead of both, and by a meaningful margin in a field where the gaps between top models have been narrowing.

What makes the GDPval-AA result more interesting than a standard benchmark finish is what the benchmark actually measures. These aren’t reasoning puzzles or code challenges in isolation — the tasks are agentic, multi-turn, and designed to mirror actual paid knowledge work. GLM-5.2 averaged roughly 31 turns per task across 1,999 matches. That’s not a model being asked to answer a question; it’s a model being asked to do a job, repeatedly, over a long horizon.

The open weights dimension adds another layer. MiniMax-M3, the next open model on the leaderboard, scores 1408 — 116 points behind GLM-5.2. In a space where open models have historically trailed proprietary systems by a wide margin on capability, that gap from open to open is almost as striking as the gap from GLM-5.2 to the proprietary models below it.

The pattern holds on AA-Briefcase, Artificial Analysis’s separate agentic knowledge work benchmark which combines rubric pass rate, analytical quality, and presentation into a single Elo score. GLM-5.2 again takes the top spot among open models with 1266, behind Claude Fable 5 (1587) and Claude Opus 4.8 (1356), but ahead of GPT-5.5 at xhigh reasoning (1159). On a benchmark built specifically around the kind of work people are paid to do — research, analysis, structured deliverables — GLM-5.2 is outperforming OpenAI’s best publicly available model.

Artificial Analysis put GLM-5.2 and three frontier models through the same set of real professional briefs: a daily task list for a retail supervisor, an IEC emergency-stop circuit schematic, and a moodboard for an orchestral ballad music video. Each deliverable was rendered exactly as the model produced it. GLM-5.2 held its own against Claude Fable 5, GPT-5.5, and Gemini 3.5 Flash across all three.

This isn’t the first time GLM-5.2 has shown up on a major leaderboard. The model placed fourth on the Artificial Analysis Intelligence Index, scoring 51 behind only Claude Fable 5 (60), Claude Opus 4.8 (56), and GPT-5.5 (55). It also leads open weights on the Agentic Index, the same category where GDPval-AA sits. These results aren’t isolated — they’re consistent across every evaluation that Artificial Analysis runs on GLM-5.2.

Knowledge Atlas has been releasing at a pace that most Western labs haven’t matched. GLM-5 launched in February, GLM-5.1 followed in late March, and GLM-5.2 dropped in June — roughly one significant model release every six weeks. GLM-5.1 had already topped SWE-Bench Pro ahead of GPT-5.4 and Claude Opus 4.6, making it the first Chinese model to lead that leaderboard. GLM-5.2 extended that streak on a different and arguably more practically relevant benchmark.

All of this is happening on Huawei Ascend chips. Knowledge Atlas has been on the US Entity List since January 2025, meaning it has no access to Nvidia hardware. The export control thesis — that restricting chip access would slow Chinese AI development — keeps running into GLM releases.

The price point adds to the picture. GLM-5.2 is priced at $1.40 per million input tokens and $4.40 per million output tokens. Claude Opus 4.8 costs $15 per million input and $75 per million output. For an open model at that price to rank alongside proprietary frontier systems on agentic work benchmarks built around real professional tasks, the gap between Chinese and American AI labs looks considerably different than the conventional narrative suggested even twelve months ago.

Posted in AI