Google Announces TurboQuant, A New Compression Algorithm That Reduces LLM Memory Requirements By 6x And Increases Speed by 8x

Even as AI progress is surprising one and all, companies are coming up with ever more improvements which could accelerate things even further.

Google has announced TurboQuant, a new compression algorithm that dramatically reduces the memory footprint of large language models (LLMs) while simultaneously increasing their speed. The company claims TurboQuant can shrink key-value (KV) cache memory requirements by up to 6x and deliver performance gains of up to 8x — a combination that could materially change the economics of running AI systems at scale.

At its core, TurboQuant tackles one of the most persistent bottlenecks in modern AI: the explosion of memory usage driven by high-dimensional vectors. These vectors power everything from semantic search to attention mechanisms in LLMs, but they come at a steep cost in both storage and compute. Traditional compression techniques, while helpful, introduce their own inefficiencies — particularly in the form of metadata overhead.

TurboQuant sidesteps these trade-offs through a combination of two new techniques: PolarQuant and Quantized Johnson-Lindenstrauss (QJL). PolarQuant restructures vector data into a polar coordinate system, allowing more efficient representation without the normalization overhead typical in standard approaches. QJL, meanwhile, reduces residual errors using a one-bit representation, effectively eliminating additional memory costs while preserving accuracy.

The result is compression down to as little as 3 bits per value — without retraining models or sacrificing output quality. In benchmarks across long-context tasks, including question answering and code generation, TurboQuant maintained accuracy while significantly reducing memory usage. It also accelerated attention computations, a key component of LLM inference.

This has immediate implications for both AI infrastructure and business models. Memory — especially high-bandwidth memory used in AI accelerators — has been one of the biggest cost drivers in deploying large-scale models. By slashing memory requirements, TurboQuant could reduce the need for expensive hardware, improve latency, and enable more efficient scaling.

Markets reacted swiftly. Shares of memory manufacturers, including Micron Technology, fell following the announcement, reflecting concerns that breakthroughs in algorithmic efficiency could dampen long-term demand for high-end memory. If models can do more with less, the pricing power of memory suppliers could come under pressure.

Beyond LLMs, Google is positioning TurboQuant as foundational infrastructure for the next generation of search. As search shifts from keyword-based queries to semantic understanding powered by vector embeddings, the ability to store and query massive vector databases efficiently becomes critical. TurboQuant enables faster indexing and retrieval with minimal preprocessing, potentially reshaping how large-scale search systems are built.

The broader takeaway is clear: AI progress is no longer just about bigger models. Increasingly, it’s about smarter math. And if TurboQuant delivers in production what it promises in benchmarks, it could quietly become one of the most consequential advances in AI infrastructure this year.

Posted in AI