PrismML Releases Ternary Bonsai That Uses Just 1.58 Bits To Store Weights, Achieves 16x Memory Reduction With Comparable Performance

Even as conventional LLMs become ever more powerful, some interesting new approaches are also producing impressive results.

PrismML, a startup spun out of Caltech and backed by Khosla Ventures and Google, quietly released something worth paying attention to this week: Ternary Bonsai, a family of language models that achieve near-frontier performance at a fraction of the memory cost. In a field where the default assumption is that bigger models mean better results, Ternary Bonsai makes a compelling case for the opposite.

What Makes It Different

Most production language models store each weight as a 16-bit floating-point number. Ternary Bonsai strips that down to just 1.58 bits per weight — meaning every parameter in the network can only be one of three values: -1, 0, or +1. That’s it. No mid-range floats, no escape hatches to higher precision for “important” layers. The entire network — embeddings, attention, MLPs, the language model head — uses this ternary representation throughout.

To preserve useful signal despite such extreme compression, the model uses group-wise quantization: for every 128 weights, a shared FP16 scale factor s is stored, so each weight is effectively {-s, 0, +s}. This lets the model adapt its effective weight magnitude to different parts of the network, while keeping the per-weight storage cost nearly as low as binary.

The result is a 9x reduction in memory footprint compared to standard 16-bit models.

The Numbers

The 8B version of Ternary Bonsai fits in just 1.75 GB — smaller than many smartphone apps — and scores 75.5 on average across six benchmarks (MMLU Redux, MuSR, GSM8K, HumanEval+, IFEval, BFCLv3). For context, that puts it ahead of RNJ 8B (73.1), Ministral3 8B (71.0), Llama 3.1 8B (67.1), and a dozen other models — all of which require 14–18 GB of memory.

Only Qwen3 8B, at 79.3 average and 16.38 GB, beats it in the 8B class.

The intelligence density chart tells the starkest story. While most 8B models cluster between 0.05 and 0.10 per GB, Ternary Bonsai 8B scores 0.803 — roughly 10x better than Qwen3 8B, and second only to the even-more-compressed 1-bit Bonsai 8B at 1.060.

On-Device Performance

The implications for edge AI deployment are significant. On an M4 Pro Mac, Ternary Bonsai 8B runs at 82 tokens per second — around 5x faster than a standard 16-bit 8B model. On an iPhone 17 Pro Max, it hits 27 tokens/sec, with energy consumption of just 0.132 mWh per token. That’s 3–4x more energy efficient than full-precision alternatives.

These are the kinds of numbers that make on-device AI — without a cloud call, without latency, without data leaving the device — genuinely practical. The models run natively on Apple hardware via MLX and are available today under the Apache 2.0 License.

A New Point on the Curve

Ternary Bonsai is not a replacement for the company’s earlier 1-bit Bonsai family. Where absolute minimum footprint matters most, 1-bit still wins. But Ternary Bonsai offers a different tradeoff: a 600 MB increase in size buys a 5-point jump in average benchmark score. Across the 1.7B, 4B, and 8B variants, that tradeoff scales predictably, giving developers a real menu of options rather than a binary choice.

At a time when AI model competition is largely defined by who can spend more on compute and data, PrismML is making a different bet: that extreme compression, done right, can be its own competitive advantage. The early results suggest they might be onto something.

Posted in AI