The 1.58-bit Revolution! Introducing "Ternary Bonsai" – 8B Model Runs at Just 1.75GB

#1.58-bit #PrismML #On-Device AI

※この記事はアフィリエイト広告を含みます

The 1.58-bit Revolution! Introducing “Ternary Bonsai” – 8B Model Runs at Just 1.75GB

📰 News Summary

Introducing a new model with 1.58-bit (ternary) representation: PrismML has released the ‘Ternary Bonsai’ family (8B, 4B, 1.7B). By constraining weights to {-1, 0, +1}, it achieves an astonishing memory reduction of nearly 9 times compared to standard 16-bit models.
Extreme Compression and Precision: The average benchmark score has improved by 5 points compared to the previous 1-bit model. The 8B model (1.75GB) hits an average score of 75.5, rivaling the performance of the Qwen3 8B, which is over 10 times its size.
Lightning-Fast Native Performance on Apple Devices: Achieving 82 toks/sec on the M4 Pro chip and 27 toks/sec on the iPhone 17 Pro Max, this model boasts energy efficiency improvements of 3 to 4 times over previous designs.

💡 Key Highlights

No Escape Route: Complete Quantization: The entire network consistently uses the 1.58-bit representation, from embedding and attention to MLP and LM head, with no compromises on precision.
Group-Level Quantization Scheme: Sharing FP16 scale factors across 128 weights while encoding each weight in 1.58 bits, maintaining high intelligence density.
Released Under Apache 2.0 License: Model weights are open-source and immediately available on Mac, iPhone, and iPad through MLX.

🦈 Shark’s Perspective (Curator’s Insights)

The “Ternary Bonsai” is truly a game changer, redefining the physical limits of local AI! What’s fascinating is the ingenious use of “1.58 bits,” a seemingly odd number that captures nuances of information that a mere 1 bit (binary) couldn’t. That extra 0.58 bits of cost is a genius move! Plus, by not leaving any high-precision escape routes, the implementation sticks to low bits across all layers, showcasing PrismML’s dedication! With this kind of performance, who needs the cloud? The era of the iPhone 17 Pro Max delivering server-grade intelligence at lightning speed is here!

🚀 What’s Next?

The standard for on-device AI is rapidly shifting from “16-bit” to “1.58-bit.” This will enable advanced inference even on low-memory, cost-effective devices, accelerating the proliferation of AI agents. Developers will soon adopt a new norm: differentiating between 1-bit (ultra-lightweight) and 1.58-bit (high-performance, lightweight) models!

💬 Shark’s Quick Take

Having an 8B model running smoothly on an iPhone feels like my stomach just grew tenfold! This efficiency is pure shark-level energy-saving power with high performance! 🦈🔥

📚 Terminology

Ternary Weights: A technique for representing the brain cells (weights) of AI using only three states {-1, 0, +1}, drastically reducing computational cost.
1.58-bit Representation: The number of bits needed to express three states (log2(3) ≒ 1.58), offering greater expressiveness than 1 bit (binary).
Pareto Frontier: The “best” line in performance versus size trade-offs beyond which improvements are impossible. This breakthrough shifts it left (to smaller and more powerful) in this case!
Source: Ternary Bonsai: Top Intelligence at 1.58 Bits