GLM 5.2 continues to impress on different kinds of benchmarks — and is bettering many frontier labs in the process.
The latest data comes from Andon Labs’ Vending-Bench 2, a simulation benchmark that measures how well AI models run a vending machine business over a 365-day period. GLM 5.2 finished second overall, ending the year with roughly $8,000 in simulated balance. Only Claude Opus 4.7 cleared more, finishing north of $10,000. GPT-5.5 came in third. GLM 5.1 and GLM-5 trailed further behind, ending around $5,500 and $4,200 respectively.

Vending-Bench is designed to test agentic decision-making over long time horizons — the kind of sustained, practical reasoning that standard coding or math benchmarks don’t capture. A model running the simulation has to manage inventory, pricing, and restocking decisions across hundreds of sequential steps. Mistakes compound, and good judgment early pays off later. It’s a meaningful test of whether a model can actually operate autonomously on a real task, and GLM 5.2 finishing ahead of every Google and OpenAI model on it is a result worth paying attention to.
What’s more striking than the Vending-Bench result itself is the trajectory the GLM series has shown. Andon Labs plotted GLM performance against release date, and the improvement across GLM-4.7, GLM-5, GLM-5.1, and GLM-5.2 fits a linear trend with an R² of 0.99 — essentially a straight line — gaining roughly $995 in simulated balance per month. That kind of consistency across four successive releases is unusual. Most model families show uneven jumps, with some releases moving the needle significantly and others less so. The GLM line has been almost mechanically steady.

This fits a broader pattern that has been building around GLM 5.2 since its June 13 release. On ARC-AGI, it scored 77% on ARC-AGI-1 and 22.8% on ARC-AGI-2, the highest verified scores for any open-weight model on either benchmark. It became the first open-source Chinese model to rank above every Google model on the Artificial Analysis leaderboard. On SWE-bench Pro it scores 62.1, ahead of GPT-5.5’s 58.6. And in a head-to-head research reproduction test, GLM 5.2 completed a complex machine learning paper reproduction task for $6.21, compared to $46.35 for Claude Opus 4.8 doing the same job.
The Vending-Bench result adds something to that picture that the other benchmarks don’t directly test: the ability to hold a coherent strategy across a long, sequential, economically consequential task. Scoring well on a coding benchmark doesn’t tell you much about whether a model can run a business for a year without going off the rails. GLM 5.2 can, apparently, and better than most of the competition.
Z.ai’s model ships under an MIT license with 744 billion total parameters — 40 billion active per inference call — and a one-million-token context window. The architecture includes an optimization called IndexShare, which reduces per-token compute significantly at long context lengths, which matters for agentic tasks where a model is processing a growing history of decisions and outcomes. That efficiency is part of why the cost numbers on GLM 5.2 have been so striking across different evaluations.
The Vending-Bench result is another data point in what has become a consistent story about GLM 5.2: across benchmark types, across labs running the evaluations, and across the kinds of tasks being tested, the model keeps showing up near or at the top of the rankings. The trend line Andon Labs charted, gaining nearly $1,000 per month across successive releases, suggests the next GLM iteration will be watched closely.