Chinese Models Including Kimi, MiniMax And DeepSeek Score Lower Than 12% On ARC-AGI 2, Lesser Than US Frontier Labs' Scores From July 2025

China is coming out with models that are doing very well on some benchmarks, but there is a benchmark that they don’t currently seem to be able to crack.

The latest results from the ARC-AGI-2 Semi-Private leaderboard show a clear gap between Chinese AI models and leading US-based frontier labs. According to results published on March 2, 2026, Chinese-developed models scored below 12 percent, far behind the top-performing international models. Kimi K2.5, developed by Moonshot AI, reached 12 percent with a cost per task of $0.28. MiniMax’s M2.5 scored 5 percent at $0.17, GLM-5 from Zhipu AI also scored 5 percent at $0.27, and Deepseek V3.2 achieved 4 percent at $0.12. All of these results remain lower than the scores achieved by US frontier labs back in July 2025, showing that while Chinese AI firms have made rapid progress in training large-scale models, they continue to lag on tasks testing deeper reasoning and abstraction.

The ARC-AGI benchmark, short for the Abstraction and Reasoning Corpus, is designed to evaluate a model’s ability to generalize, reason conceptually, and solve unfamiliar problems. Unlike traditional benchmarks such as MMLU or GSM8K, which test structured knowledge or specific problem-solving ability, ARC-AGI focuses on emergent reasoning — the kind of intelligence associated with progress toward artificial general intelligence. The current leaderboard places Gemini 3.1 Pro (Preview) from Google DeepMind at the top with around 85 percent, followed by Claude Opus 4.6 (120K Max) from Anthropic at roughly 70 percent, and GPT-5.2 (High) at about 45 percent. These higher scores come at a significantly greater cost per task, typically between one and ten dollars, compared to the much cheaper Chinese entries.

ARC Prize organizers noted that Semi-Private testing is conducted only with providers that have trusted data retention agreements. This requirement currently excludes Qwen 3 Max Thinking, another leading Chinese model, from the ARC-AGI-2 leaderboard.

Some analysts suggest that these results reflect a fundamental difference in design priorities between US and Chinese AI developers. Anthropic CEO Dario Amodei has said that many Chinese AI models are tuned to perform well on specific benchmarks rather than real-world applications. That focus may explain why they can excel on certain standardized tests but fall short on broader reasoning tasks such as those evaluated by ARC-AGI.

Overall, the ARC-AGI-2 results illustrate the growing divergence between efficiency-oriented model design and the pursuit of general intelligence. While Chinese models demonstrate strong cost performance and rapid iteration, US labs continue to lead in reasoning ability and abstraction, suggesting that the path to true general intelligence may depend less on scale and more on the sophistication of underlying reasoning architectures.