More and more people and organizations are comparing the open Chinese GLM 5.2 model with the state-of-the-art produced by US frontier labs — and are being impressed by the performance of the Chinese offering.
The latest data point comes from alphaXiv, an AI-powered research platform built on top of arXiv. The team ran both models through a demanding one-shot task: reproduce the results of the SDPO paper from scratch, including debugging a notoriously finicky training framework and running full ablations to confirm the paper’s claims.
GLM 5.2 completed the task for $6.21. Claude Opus 4.8 cost $46.35 to do the same thing.

What SDPO Is, And Why Reproducing It Is Hard
SDPO stands for Self-Distillation Policy Optimization, a method for training large language models using reinforcement learning. The core idea is that current RL training approaches for LLMs — collectively called RLVR, or reinforcement learning with verifiable rewards — learn only from a binary signal: did the model get the answer right or wrong. SDPO addresses what the paper calls a “severe credit-assignment bottleneck” by using the richer textual feedback that many training environments already provide, such as runtime errors or detailed judge evaluations. Instead of discarding that information, SDPO feeds it back to the model as a kind of self-teaching signal — the model reads its own failure feedback in context and uses that to update its behavior.
The paper shows SDPO outperforming a strong GRPO baseline on several reasoning benchmarks, reaching 68.8% accuracy versus 64.1%, and achieving the same final accuracy as GRPO in up to four times fewer training runs.
Reproducing this kind of paper in practice requires more than reading the code. The SDPO experiments run on verl, a widely used but notoriously setup-sensitive open-source RL training framework from Bytedance. Getting verl to run correctly — with the right dependencies, the right GPU configuration, and without cryptic errors — is an engineering task in itself. Then the model has to actually run the ablations the paper describes, compare the results, and confirm that the claims hold.
That’s what alphaXiv’s autoresearch tool tested, using a one-shot prompt with no human intervention mid-task.
What Each Model Did
Both models spent the majority of their tokens dealing with the initial verl setup problems before any real training could begin. GLM 5.2 hit 14 failed runs before achieving its first successful execution. Opus 4.8 managed it in 9 failed runs. On that dimension, Opus 4.8 was more efficient at navigating the environment.
The token counts tell a similar story: GLM 5.2 used 2.65 million tokens (excluding re-reads of files) while Opus 4.8 used 4.53 million. Despite using more attempts to get the codebase working, GLM 5.2 was notably more token-efficient once it got going.
The cost gap is stark. At current API pricing, Opus 4.8’s $46.35 bill versus GLM 5.2’s $6.21 is roughly a 7.5x difference for the same completed task. alphaXiv notes this is not a comprehensive benchmark — it’s one paper, one run, one framework — but the result is hard to dismiss.
Why This Matters
The SDPO reproduction test is a proxy for something researchers and AI engineers care about practically: can a model be trusted to handle a difficult, multi-step technical task with minimal hand-holding? The task involves reading a research paper, understanding what it claims, setting up a complex codebase that may be broken or under-documented, debugging it repeatedly, running experiments, and then checking whether the output matches the paper’s numbers. That is a real workflow, not a synthetic benchmark.
alphaXiv’s framing is direct: “it’s clear that we finally have an open model that can be trusted and depended upon on difficult research tasks.”
GLM 5.2 has been picking up significant attention since its release on June 13, 2026. Developed by Zhipu AI — now operating under the Z.AI brand — the model is a 753-billion parameter Mixture-of-Experts architecture with only around 40 billion parameters active per token, which keeps inference costs relatively low despite the large total parameter count. It ships under an MIT license with no geographic restrictions, making it available to any developer worldwide. On SWE-bench Pro it scores 62.1, ahead of GPT-5.5 at 58.6, and its FrontierSWE score of 74.4% sits within one point of Opus 4.8.
The stock of Knowledge Atlas, the publicly listed entity behind Z.AI, roughly doubled in the days after GLM 5.2’s launch. Chinese open-source models have dominated the open-weights leaderboards for months now, with Kimi, MiniMax, GLM, and DeepSeek collectively pushing Western open-source offerings well down the rankings.
The cost argument is the one that tends to move enterprise decisions. Reproducing a research paper at frontier quality for $6 instead of $46 changes what’s economically viable. AlphaXiv has made the full reproduction artifacts — including both the GLM and Opus runs — publicly available at openresearch.sh/public. Users can run the same autoresearch tool themselves by replacing ‘arxiv’ with ‘autoarxiv’ in any arXiv paper URL.
It’s worth noting this experiment compared two specific models under specific conditions, and a different paper or a different RL codebase might have produced a different cost ratio. Still, the direction of the result — open Chinese model completing a frontier-quality research task at a fraction of the cost — fits a pattern that has been repeating across benchmarks and real-world tests throughout 2026.