Alibaba’s Qwen 3.7 Max Becomes Highest-Placed Chinese Model On Artificial Analysis Index, Is Ahead Of Gemini 3.5 Flash

These Chinese labs are competing among themselves on who can be closest to the US-led frontier.

Alibaba’s latest proprietary flagship, Qwen3.7 Max, has scored 56.6 on the Artificial Analysis Intelligence Index, placing it fifth overall and making it the highest-ranked Chinese model on the leaderboard. The score represents a 4.8-point gain over its predecessor, Qwen3.6 Max Preview (51.8), and puts it ahead of Google’s Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) still lead the pack.

The Intelligence Index v4.0 aggregates ten evaluations — including GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity’s Last Exam, and GPQA Diamond — making it one of the more comprehensive third-party benchmarks available.

Where The Gains Come From

The improvement is not uniform. Most of the Index gains are concentrated in scientific reasoning, agentic capability, and coding. CritPt rose 9.7 percentage points (3.7% to 13.4%), Humanity’s Last Exam jumped 9.2 points (28.9% to 38.1%), and Terminal-Bench Hard climbed 6.9 points (43.9% to 50.8%). GDPval-AA added 42 Elo points (1504 to 1546). Scores on other benchmarks in the index are largely flat compared to Qwen3.6 Max Preview.

One significant contributor to the overall gain is less flattering: higher abstention on AA-Omniscience. Qwen3.7 Max’s raw accuracy on this benchmark actually dropped 7.6 percentage points (37.7% to 30.1%), while its hallucination rate fell 21.3 points (44.2% to 22.9%). The model is choosing to say “I don’t know” more often rather than recalling more facts — a strategy that lifts the Index score without improving genuine knowledge. Its attempt rate fell from 67.3% to 48.0%, the lowest among frontier models in the comparison.

This is a meaningful distinction. The AA-Omniscience Index rewards correct answers and penalizes hallucinations, but has no penalty for refusing to answer. A model that abstains heavily can post strong hallucination numbers without getting smarter. Qwen3.7 Max currently holds the lowest hallucination rate among frontier models — but partly because it’s answering fewer questions.

qwen 3.7 max benchmarks

The Closed-Weights Pattern Continues

Since Qwen2.5 Max in January 2025, Alibaba has followed a consistent strategy: release Max and Plus variants as closed-weights proprietary models while keeping the rest of the Qwen line open. The leading open-weights Qwen on the Intelligence Index is Qwen3.6 27B (Reasoning, 45.8), released in April 2026. The leading open-weights MoE Qwen is Qwen3.5 397B A17B (Reasoning, 45.0), from February 2026.

The model features a 1M token context window — up from 256K on Qwen3.6 Max Preview — and supports text input and output only. Pricing has not yet been announced. Qwen3.6 Max Preview was priced at $1.30/$7.80 per million input/output tokens on Alibaba Cloud.

On token usage, Qwen3.7 Max consumed 96.7M output tokens running the Intelligence Index — about 31% more than Qwen3.6 Max Preview (73.9M). That puts it mid-pack on frontier token efficiency: above GPT-5.5 (44.5M) and Gemini 3.1 Pro Preview (57.3M), but below Claude Opus 4.7 (112M), Kimi K2.6 (166M), and DeepSeek V4 Pro (187M).

Where Alibaba Stands In The China AI Race

Chinese labs have been closing the gap on US frontier models with growing speed, and Qwen3.7 Max is Alibaba’s strongest showing yet on a third-party benchmark aggregator. Still, the lead models from OpenAI, Anthropic, and Google remain 3-4 points ahead on the Index. DeepSeek, which had first brought China’s AI capabilities to global attention, has itself acknowledged being 3-6 months behind the US frontier. After a period where Qwen and Kimi pulled ahead of DeepSeek in certain rankings, DeepSeek V4 Pro returned to second place among open models, and the intra-China competition shows no sign of slowing.

For enterprise buyers, Qwen3.7 Max’s hallucination improvements — even if driven partly by abstention — are practically meaningful. A model that admits uncertainty is often more useful in production than one that confabulates confidently. The upcoming pricing announcement will determine whether those reliability gains translate into commercial traction.

Posted in AI