Alibaba Releases Qwen 3.5 Small Model Series, Achieves GPT-OSS-Level Performance With A Fraction Of The Parameters

China is continuing to push the frontier on what open-source models are capable of.

Alibaba’s Qwen team has unveiled the Qwen 3.5 Small Model Series — a new family of compact AI models spanning four sizes: 0.8B, 2B, 4B, and 9B parameters. The release also includes base models for each size, which the team says is intended to better support research, experimentation, and real-world industrial applications. The announcement underscores a growing trend in AI development: delivering competitive intelligence at dramatically lower compute costs.

Punching Well Above Their Weight

The headline story is efficiency. The Qwen 3.5-9B — the largest model in the small series — is closing the performance gap with models an order of magnitude larger. Benchmark data released alongside the models shows Qwen3.5-9B matching or surpassing GPT-OSS-120B across multiple evaluations, including GPQA Diamond (81.7 vs. 71.5), HMMT Feb 2025 (83.2 vs. 76.7), MMMU-Pro (70.1 vs. 59.7), and ERQA (55.5 vs. 44.3). In several categories, the 9B model also outperforms the much larger GPT-OSS-120B — a remarkable result for a model a fraction of the size.

Across multilingual benchmarks (MMMLU), Qwen3.5-9B scores 81.2, edging out both GPT-OSS variants and matching Qwen3-Next-80B-A3B-Thinking at 81.3. On document recognition and understanding (OmniDocBench v1.5), it leads the pack at 87.7.

A Tiered Architecture for Every Use Case

The Qwen 3.5 Small Series is designed with a clear tiered strategy. The 0.8B and 2B models are optimized for speed and minimal resource consumption, making them suitable for edge devices, on-device inference, and latency-sensitive applications. The 4B is positioned as a capable multimodal base model for lightweight AI agents — offering a balance between capability and footprint that few models at this size have previously achieved. The 9B is the series’ flagship compact model, delivering performance that would have been unthinkable at this parameter count even a year ago.

All models are built on the Qwen3.5 foundation architecture, which features native multimodal support, an improved model architecture, and scaling through reinforcement learning — the same training approach credited with major capability gains in frontier-scale models.

Elon Musk Takes Notice

The release drew attention from an unexpected corner of the tech world. Elon Musk — who has been notably critical of OpenAI and Anthropic in recent months, often questioning the capabilities or direction of their models — offered a succinct endorsement of the Qwen 3.5 Small Series on X, posting simply: “Impressive intelligence density.”

The comment, brief as it was, carried weight. “Intelligence density” — the ratio of capability to model size — is precisely what Alibaba’s Qwen team set out to demonstrate with this release. Coming from someone who rarely misses an opportunity to question competitors’ AI progress, the praise is a notable signal of how the broader AI community is receiving these results.

But there are skeptics of such claims as well. Anthropic’s Dario Amodei recently said that Chinese models were tailored to perform well on benchmarks, but aren’t quite as impressive in the real world.

The Open-Source Calculus

The release of base models alongside the instruction-tuned variants is a deliberate move by Alibaba to deepen the open-source ecosystem around Qwen. Base models give researchers and developers the raw foundation to fine-tune for specialized tasks, build proprietary applications, or conduct academic work — without the constraints of instruction-tuned behavior.

The broader implication is clear: the gap between open-source and closed proprietary models continues to narrow, and China’s AI labs are playing a central role in driving that convergence. For enterprises evaluating AI deployment — particularly those with cost, latency, or data sovereignty constraints — the Qwen 3.5 Small Series represents a compelling new option. Frontier-level reasoning at a fraction of the compute bill is no longer a theoretical promise. It’s a benchmark result.