China’s Z.AI Releases GLM-5.1, Beats All US Models On SWE-Bench Pro

A Chinese model is now best in the world at a crucial coding benchmark.

Z.AI, the Beijing-based lab formerly known as Zhipu AI, has released GLM-5.1 — and its headline number is hard to ignore. The model scored 58.4 on SWE-Bench Pro, the industry’s toughest software engineering evaluation, clearing GPT-5.4 (57.7), Claude Opus 4.6 (57.3), and Gemini 3.1 Pro (54.2). It is the first Chinese model to top the SWE-Bench Pro leaderboard, and it does so while running on zero Nvidia hardware.


The SWE-Bench Pro Result

SWE-Bench Pro tests models on complex, real-world GitHub issues — the kind of multi-file bugs and system-level refactors that distinguish capable coding agents from autocomplete engines. A 58.4 puts GLM-5.1 roughly a full point ahead of GPT-5.4 and 1.1 points ahead of Claude Opus 4.6. In a field where frontier models are separated by fractions, that margin is meaningful.

GLM 5.1 benchmarks: Performance on SWE Bench Pro

Z.AI describes GLM-5.1 as a post-training upgrade to GLM-5 — same 744B-parameter Mixture-of-Experts architecture (40B active per token), same 200K context window, retargeted reinforcement learning pipeline aimed specifically at coding distributions. The base GLM-5 had already established itself as the first open model to score 50+ on the Artificial Analysis Intelligence Index, beating Gemini 3 Pro. GLM-5.1 pushes that further.

According to Z.AI, the model can run autonomously for up to eight hours, refining strategies across thousands of iterations — a capability the company calls “long-horizon agentic engineering.”

Coding Dominance Across the Board

SWE-Bench Pro is not a one-off outlier. GLM-5.1’s coding strength extends across multiple benchmarks.

Z.ai GLM 5.1 benchmarks
  • NL2Repo (42.7): Top score among all models listed, ahead of Claude Opus 4.6’s 49.8 and GPT-5.4’s 41.3 — this benchmark tests a model’s ability to generate entire repository structures from natural language descriptions.
  • Terminal-Bench 2.0 (63.5 on Terminus-2 / 66.5 with Claude Code harness): Top-3 globally. Terminal-Bench evaluates agents completing long, multi-step shell tasks with real execution environments.
  • CyberGym (68.7): The highest score among listed models, well ahead of Claude Opus 4.6 (66.6) and DeepSeek-V3.2 (17.3). CyberGym tests cybersecurity reasoning under adversarial conditions.

The NL2Repo and CyberGym results are particularly notable — they test very different ends of the software engineering spectrum, and GLM-5.1 leads on both.


Agentic Performance

Beyond raw coding, GLM-5.1 performs strongly across agentic benchmarks — tasks that require sustained multi-step reasoning, tool use, and goal tracking:

  • BrowseComp (68.0 / 79.3 with context management): Top open-model score, trailing only proprietary systems on the context-managed variant.
  • MCP-Atlas (71.8): Top score overall, ahead of Qwen3.6-Plus (74.1) and Claude Opus 4.6 (73.8) — this tests multi-step tool invocation across real APIs.
  • τ³-Bench (70.6): Competitive with GPT-5.4 (72.9) and ahead of Claude Opus 4.6 (72.4).
  • Vending Bench 2 ($5,634): GLM-5.1 runs a simulated vending business across a full simulated year and finishes with the second-highest balance, behind Claude Opus 4.6’s $8,017. Vending Bench 2 is one of the few benchmarks that directly proxies economic decision-making under uncertainty.

Chinese open models have increasingly been dominant on agentic tasks — GLM-5.1 continues that pattern.


Where US Models Still Lead

The picture is not uniformly in GLM-5.1’s favor. On reasoning benchmarks, US and other frontier models hold an edge:

  • HLE (31.0): Claude Opus 4.6 scores 36.7, Gemini 3.1 Pro reaches 45.0, and GPT-5.4 hits 39.8.
  • GPQA-Diamond (86.2): Behind Gemini 3.1 Pro (94.3), GPT-5.4 (92.0), and Claude Opus 4.6 (91.3).
  • AIME 2026 (95.3): Trailing GPT-5.4 (98.7) and Gemini 3.1 Pro (98.2).

These gaps suggest GLM-5.1’s engineering is deliberately targeted — the RL pipeline has been optimized for practical coding and agentic execution, not pure mathematical reasoning. It’s a tradeoff that reflects Z.AI’s explicit positioning around “Agentic Engineering” rather than general-purpose intelligence.

Anthropic CEO Dario Amodei has previously argued that Chinese models tend to be benchmark-optimized and distilled from US labs. Whether that critique applies to GLM-5.1’s SWE-Bench Pro result specifically will depend on independent verification — Z.AI’s benchmarks are self-reported, though its prior SWE-Bench Verified scores for GLM-5 held up well under third-party testing.


The Hardware Story

There is a dimension here that goes beyond AI performance. GLM-5.1 was trained entirely on Huawei Ascend 910B chips using Huawei’s MindSpore framework — no Nvidia, no AMD, no American silicon. Zhipu AI has been on the US Entity List since January 2025, effectively barred from acquiring US-manufactured accelerators. The result is a model that tops a key global benchmark despite operating entirely outside the Western AI hardware stack.

Z.AI completed a Hong Kong IPO in January 2026, raising approximately $558 million USD, and the capital is visibly accelerating its release cadence: GLM-5 launched February 11, GLM-5-Turbo on March 15, and GLM-5.1 on March 27 — three significant model updates in six weeks.


Pricing and Access

GLM-5.1 is available to all GLM Coding Plan subscribers, with plans starting at $3/month (promotional) and a standard rate from $10/month. API access is priced at $1.00/M input tokens and $3.20/M output tokens. For context, Claude Max runs $100–200/month.

Z.AI has confirmed GLM-5.1 will be open-sourced, though no timeline has been set. The GLM-5 base model is already available on HuggingFace under an MIT license.


The Bigger Picture

With 80% of startups now gravitating toward Chinese open models according to Andreessen Horowitz data, GLM-5.1’s SWE-Bench Pro result arrives at a moment when the competitive stakes couldn’t be higher. Software engineering is the benchmark that matters most to enterprise AI buyers — it’s where models either earn their place in production pipelines or don’t.

GLM-5.1 just made a compelling case.

Posted in AI