Z.ai’s GLM 5.2 has gone viral in recent days, but a top CEO has shared how the model compares to another on on the frontier.
Snowflake CEO Sridhar Ramaswamy has posted a detailed breakdown comparing Z.ai’s GLM-5.2 with Anthropic’s Claude Opus 4.7 on dbt-bench, a benchmark designed to evaluate AI models on data transformation and analytics engineering tasks. The findings suggest that while the two models end up with nearly identical overall success rates, they get there in very different ways.

The analysis came from Snowflake’s Coco team, which ran 103 dbt tasks with three trials each on both models. The headline numbers show an almost dead heat. GLM-5.2 achieved a Pass@3 score of 66 percent, while Opus 4.7 came in at 67 percent. At the first-attempt level, however, Opus held a clearer lead, scoring 53.7 percent on Pass@1 compared to GLM’s 47.6 percent.
The results are noteworthy because GLM-5.2 has generated significant interest in recent weeks for delivering strong performance as an open model. Earlier this year, China’s GLM family had already begun climbing coding leaderboards, with GLM-5.1 becoming one of the highest-ranked open models on Code Arena.
According to Ramaswamy, one of the biggest differences between the models lies in how they approach tasks. GLM takes considerably more turns to complete work, averaging 99 turns compared to 80 for Opus. It also makes more execution-related tool calls, averaging 40 per trial against Opus’s 29.
That difference translates into token consumption. Across the benchmark run, GLM used 860 million billing tokens compared to Opus’s 439 million. Snowflake’s team attributed this to a combination of more conversational turns, more atomic API calls, and lower prompt-cache reuse rates.
The popular perception that GLM verifies its work more thoroughly was only partially supported by the data. The study found that GLM performs validation differently rather than necessarily performing more meaningful validation. It often executes individual SQL checks one at a time, while Opus bundles similar checks together. Both models end up covering similar ground, but their workflows look very different under the hood.
The findings also challenge another common assumption: that heavier verification automatically leads to better outputs. Despite GLM’s tendency to perform more checks, Opus still held a six-percentage-point advantage on Pass@1. As Ramaswamy put it, “more verification ≠ more correct.”
The area where GLM appeared to have a distinct advantage was cross-platform validation. The benchmark required solutions to work on both DuckDB and Snowflake. Snowflake’s team found that GLM was more consistent in validating against both targets, which explained several tasks that GLM solved successfully while Opus did not.
The post also highlighted two recurring failure modes. In some cases, GLM gave up too early when it couldn’t infer a solution path from available information. In one task cited by the team, the model performed five file reads across 22 turns but never attempted a write operation before stopping.
The opposite problem appeared in other tasks. One example saw GLM make 411 tool calls over 24 minutes while exhaustively checking row counts, distributions, null values, column types and platform parity. The task still failed in all three attempts. Opus completed the same task with 49 calls in nine minutes.
Interestingly, the “GLM uses twice as many calls” narrative turned out to be somewhat misleading. On tasks that both models solved successfully, GLM used only around 17 percent more calls. The large gap emerged primarily from difficult edge cases where the model entered lengthy verification loops.
The conclusion from Snowflake’s analysis was nuanced. Verification volume by itself was not a reliable predictor of success. Several of GLM’s worst failures came from spending enormous effort validating the wrong aspects of a task, while another category of failures stemmed from abandoning tasks prematurely.
Even so, Ramaswamy sounded optimistic about the model’s future. He said Snowflake was “super excited” about what GLM-5.2 represents and was looking forward to tuning Coco’s evaluation harness further and making the model available to customers.
The post offers a rare look at how frontier models behave beyond benchmark leaderboards. While aggregate scores often dominate discussion, Snowflake’s analysis shows that the path a model takes to reach those scores can reveal just as much about its strengths and weaknesses.