GLM 5.2 Scores 77% On ARC-AGI 1 & 22% On ARC-AGI-2, Highest For An Open Model

GLM 5.2 has done well on many real-world use benchmarks, and it also seems to be ahead of other open models when it comes to ARC-AGI.

Z.ai’s GLM 5.2 has posted an ARC-AGI-1 score of 77.0% at $0.19 per task and an ARC-AGI-2 score of 22.8% at $0.25 per task, according to ARC Prize’s verified leaderboard. Both are the highest scores recorded for any open-weight model on either benchmark. The model uses chain-of-thought (CoT) reasoning and is comparable in performance on the low reasoning effort settings of GPT-5.4 and GPT-5.5 — useful context, since those are proprietary models from OpenAI.

ARC-AGI is designed to test something closer to fluid intelligence than the standard capabilities benchmarks. Its grid-based puzzle tasks are structured so that memorization and pattern-matching don’t help — a model has to infer a novel rule from just a few examples and apply it correctly to an unseen case. Scores on ARC-AGI-1 have climbed steadily since 2024; ARC-AGI-2, introduced in March 2025, pushed the difficulty further with multi-step reasoning, sequential rule application, and symbolic interpretation. When it launched, the best reasoning models barely cleared 1%.

GLM 5.2’s 22.8% on ARC-AGI-2, while the highest among open models, still sits far behind the frontier. Gemini 3.1 Pro — a model released several months earlier — scores 77.1% on ARC-AGI-2, and Gemini 3 Deep Think has reached 84.6%. The gap between 22.8% and 77.1% is large enough to be meaningful. These aren’t just different points on the same curve — they reflect a qualitative difference in how well a model handles compositional generalization, which is the kind of reasoning ARC-AGI-2 is specifically built to probe.

That gap says something broader about where open models are in relation to AGI-like tasks. ARC-AGI-2 was designed to resist saturation, and for much of last year it succeeded — even frontier closed models struggled to get past 50%. The leapfrog to the high seventies came from Gemini 3.1 Pro in February 2026, and it required both architectural scale and training investment that most labs, open or closed, haven’t matched. The fact that GLM 5.2 is leading the open-weight field at 22.8% is a real milestone; the fact that the open-weight field is still more than 50 percentage points behind the best closed model on this particular benchmark suggests that general reasoning at scale remains firmly in the territory of well-resourced frontier labs.

GLM 5.2 has otherwise performed impressively across the benchmarks that matter for real-world deployment. On coding and engineering tasks, it tops the Artificial Analysis Intelligence Index with a score of 51, clearing DeepSeek V4 Pro and MiniMax-M3, and becoming the first open-source Chinese model to rank ahead of all Google models on that index. It scores 62.1 on SWE-bench Pro, ahead of GPT-5.5’s 58.6. The model ships under an MIT license, uses 744 billion total parameters with 40 billion active per inference call, and supports a one-million-token context window — a fivefold increase over its predecessor.

The architecture includes an optimization called IndexShare, which shares a single attention index across multiple sparse layers rather than recalculating it at each step, reducing per-token compute by roughly three times at long context lengths. For developers working with large codebases or agentic workflows, that efficiency matters. Shares of Knowledge Atlas, the publicly listed entity connected to Z.ai, have roughly doubled since the model’s release — a reflection of how seriously the market is taking the GLM 5.2 numbers.

The ARC-AGI-2 result is a genuine achievement in context. No open-weight model has posted a higher verified score. At the same time, 22.8% on a benchmark where the ceiling is approaching 85% makes clear that whatever is driving the top closed models — data, compute, reinforcement learning approaches, or some combination — hasn’t been replicated in the open ecosystem yet. Whether it eventually will be is an open question, but on the basis of current scores, the generalization gap between the best open models and the frontier on tasks like ARC-AGI-2 looks more like a structural divide than a lag that closes in a few release cycles.