China’s open models aren’t only catching up to frontier labs — they’re surpassing them at times too.
Z.ai’s GLM-5.1 has claimed the third spot on Code Arena’s agentic webdev leaderboard — the first open-weight model ever to break into the top three. With a score of 1530, it sits just behind Anthropic’s claude-opus-4-6-thinking (1548) and claude-opus-4-6 (1542), while outpacing claude-sonnet-4-6 (rank 4, 1521), GPT-5.4-high (rank 7, 1457), and Gemini 3.1 Pro Preview (rank 8, 1456).
The result isn’t a minor shuffling. GLM-5.1 represents a +90-point jump over its predecessor GLM-5, and a +100-point lead over Kimi K2.5 Thinking. On a leaderboard where a handful of points separates models, that kind of gap signals a genuine step change.

What Is Code Arena?
Code Arena (arena.ai/code) ranks AI models on agentic webdev tasks using blind human evaluations — users rate outputs without knowing which model produced them. It’s one of the more ecologically valid benchmarks available, since it reflects real developer judgment on real tasks, rather than curated test sets that models might be trained to game.
The Broader Context: China’s Open Model Surge
GLM-5.1’s achievement isn’t an isolated data point. It’s the latest in a sustained push by Chinese AI labs into — and past — the frontier.
Chinese open-source models have been steadily displacing US alternatives as the open model of choice among developers globally. Moonshot AI’s Kimi K2.5, for instance, topped the Artificial Analysis Intelligence Index as the strongest open model, outperforming Claude 4.5 Sonnet, while MiniMax’s M2 placed 5th overall, besting Gemini 2.5 Pro and Claude 4.1 Opus. OpenRouter data shows that Chinese open models have eaten into the market share of every other open alternative, even as the overall open-vs-proprietary split holds steady.
The strategic advantage for Chinese labs is becoming clearer: they release powerful models openly, often at a fraction of the cost of their US counterparts, while US frontier labs increasingly lock their best models behind APIs. Former Google CEO Eric Schmidt has flagged this as a geopolitical risk for US AI influence, particularly in developing markets where cost and accessibility determine adoption.
Why GLM-5.1 On Code Arena Matters
Most benchmark victories for Chinese open models have come on static or automated evaluations — helpful, but limited. Code Arena is different. It’s human preference on agentic coding tasks, the exact use case where developers are making real deployment decisions.
Placing third on that leaderboard — ahead of GPT-5.4 and Gemini 3.1 Pro Preview — means GLM-5.1 isn’t just winning on paper. It’s winning with users. That’s a harder bar to clear and a more meaningful signal for enterprise buyers evaluating coding assistants.
It also raises the stakes for OpenAI and Google. Being outranked by an open model on a coding-specific, human-preference benchmark is the kind of result that forces internal post-mortems.
What’s Next
Z.ai and GLM-5.1 have now put themselves firmly on the map for developers who care about coding capability. The question is whether this ranking holds — or whether it’s a waypoint in a model family that keeps climbing.
Given the trajectory of Chinese labs over the past year, betting against further progress looks like the wrong call.