PrismML Launches 1-bit Bonsai 8B Model That Is 14x Smaller, 8x Faster Than Competitors

Even as models keep getting larger, some companies are moving models in the opposite direction — with some impressive results.

Caltech-originated AI lab PrismML emerged from stealth this week, open-sourcing a family of 1-bit language models under the Apache 2.0 license. The flagship, 1-bit Bonsai 8B, packs 8.2 billion parameters into just 1.15 GB of memory — compared to the 16 GB a standard FP16 model of the same parameter count requires. It runs 8x faster, uses 4–5x less energy on edge hardware, and benchmarks competitively against full-size 8B models including Llama 3.1 8B, LFM2 8B, and Hermes 3 8B.

The company was co-founded by Babak Hassibi, Sahin Lale, Omead Pooladzandi, and Reza Sadri — researchers with roots in Caltech’s mathematics and computer science departments.

What “1-bit” Actually Means

Standard LLMs store each weight — the numerical values that encode the model’s learned knowledge — in 16-bit or 32-bit floating point format. That precision is expensive: more bits per weight means more memory, more bandwidth, and more power at inference time.

A 1-bit model reduces each weight to a single bit: essentially {-1, 0, +1}. This isn’t new as a concept, but applying it across an entire network — embeddings, attention layers, MLP layers, and the LM head — without higher-precision escape hatches has historically meant severe performance degradation. PrismML claims to have solved that with a proprietary training and quantization approach that preserves capability through extreme compression.

The result is a model that is architecturally identical in scale to its full-precision peers but occupies a fraction of the storage and compute footprint. Google recently announced a similar efficiency push with TurboQuant, a compression algorithm that cuts LLM memory by 6x — though even that falls well short of the 14x reduction PrismML is claiming.

The Intelligence Density Argument

PrismML frames its competitive advantage around a metric it calls intelligence density: the negative log of the model’s error rate divided by model size in GB. By this measure, 1-bit Bonsai 8B scores 1.06 per GB, versus ~0.096 for the next closest model, Qwen 3 8B — over 10x higher.

On raw benchmark performance, Bonsai 8B scores an average of 70.5 across IFEval, GSM8K, HumanEval+, BFCL, MuSR, and MMLU-Redux. That puts it above Llama 3.1 8B (67.1) and LFM2 8B (69.6), and close to Olmo 3 7B (70.9) and Ministral3 8B (71.0) — all of which are 14x larger in memory footprint. The top-ranked model in the benchmark comparison, Qwen 3 8B, scores 79.3 but requires 16.38 GB.

PrismML also released two smaller variants: Bonsai 4B (0.57 GB, ~130 tokens/sec on an M4 Pro) and Bonsai 1.7B (0.24 GB, ~130 tokens/sec on an iPhone). The scatter plot of performance vs. model size shows the Bonsai family defining an entirely new Pareto frontier — achieving benchmark scores comparable to models 10–15x their size.

Why This Matters

The practical implication is straightforward: if models this capable can run in under 1.5 GB, they can run on a phone, a laptop, an embedded device — without a cloud call. That changes the economics and architecture of AI deployment. AI agents that run locally no longer require always-on connectivity or expensive inference infrastructure.

On an iPhone 17 Pro Max, Bonsai 8B reportedly runs at ~44 tokens per second — fast enough for real-time interaction. For robotics, wearables, or offline applications, those numbers matter considerably more than cloud benchmark rankings.

PrismML is headquartered in Pasadena, with job openings in both Pasadena and San Francisco. Models are available on Hugging Face under the prism-ml namespace.