Google Makes Gemma4 3x Faster Through Multi-Token Prediction Drafters

AI models aren’t only getting cheaper and more capable, but algorithmic advances are also helping them become faster.

Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 model family, delivering up to a 3x speedup in token generation without any degradation in output quality. The release addresses one of the most persistent frustrations in deploying large language models: latency.

The Bottleneck Problem

Standard LLM inference is memory-bandwidth bound. For every token the model generates, it must move billions of parameters from VRAM to compute units — a process that leaves the processor significantly underutilized and stacks up latency fast. The problem is particularly acute on consumer hardware, where memory bandwidth is limited.

Google’s MTP drafters address this through speculative decoding, a technique that decouples token generation from verification. A lightweight drafter model predicts several tokens ahead, and the heavier target model — say, Gemma 4 31B — verifies all of them in parallel in a single forward pass. If the target model agrees with the draft, it accepts the entire sequence and generates one additional token of its own. The net effect: output that would normally require multiple sequential passes gets delivered in roughly the time of one.

The draft models share the target model’s KV cache and leverage its existing activations, meaning they don’t redundantly recompute context. This tight architectural integration is what separates Gemma 4’s approach from looser speculative decoding implementations.

Why Speed Matters More at the Edge

The performance gains are compelling across the board, but they’re especially consequential for on-device use cases — which is precisely where Google has been placing its bets with Gemma 4.

The Google AI Edge Gallery app recently climbed to #8 among productivity apps on Apple’s App Store, signaling genuine mainstream interest in running AI locally. Gemma 4’s edge-optimized variants — the Effective 2B (E2B) and Effective 4B (E4B) — are designed specifically for smartphones, running under 1.5GB of memory in some configurations. They support a 128K context window, multimodal input, and over 140 languages, all entirely offline.

For these models, speed is a functional constraint. On-device AI competes directly with cloud-based alternatives on two dimensions: capability and responsiveness. Cloud models can always throw more compute at a problem; on-device models have to be fast with what they have. MTP drafters help close that gap meaningfully.

Google notes that for the E2B and E4B models, the final logit calculation represents a disproportionate share of compute time. To address this specifically, the company implemented an efficient clustering technique in the embedder — an optimization targeted directly at the edge inference bottleneck.

Battery life is the other variable. Faster generation could mean fewer active compute cycles per task, which directly translates to lower power consumption. For mobile AI, where a model that drains a phone battery will simply stop being used, this could a critical practical advantage.

Agentic Workflows Benefit Too

Beyond passive chat, the speed gains are significant for agentic use cases. The AI Edge Gallery already supports multi-step agentic workflows running entirely on-device — tasks where the model uses tools like Wikipedia and maps autonomously, without a network connection. In these settings, latency compounds: a 200ms delay per step becomes seconds of waiting across a multi-turn agent loop. Faster inference makes these workflows feel less like waiting for a batch job and more like a responsive tool.

The same logic applies to coding assistants and voice applications, where response cadence shapes the entire user experience.

Hardware-Specific Gains

Google has been careful to calibrate expectations by hardware. On Apple Silicon, the 26B Mixture-of-Experts model presents routing challenges at batch size 1, but processing multiple requests simultaneously — batch sizes of 4 to 8 — unlocks roughly a 2.2x speedup. Similar gains appear on NVIDIA A100s at higher batch sizes. On an NVIDIA RTX PRO 6000, the speedup with MTP drafters cuts wait time approximately in half.

Availability

The MTP drafters are available now under the same Apache 2.0 license as the base Gemma 4 models — the commercially permissive licensing that Hugging Face co-founder Clément Delangue called a significant milestone when it was first announced. Weights are available on Hugging Face and Kaggle, with framework support across Transformers, MLX, vLLM, SGLang, and Ollama.

For developers who have been waiting for on-device AI to be practical rather than merely impressive, faster Gemma 4 inference could be a meaningful step forward.