LLM’s “Memory” Slimmed Down to One-Fifth! From 300KB to 69KB - The Evolution of KV Cache
📰 News Overview
- Dramatic Weight Reduction of KV Cache: Memory consumption which required 300KiB per token in 2019’s GPT-2 has been slashed to 68.6KiB in 2024’s DeepSeek V3.
- Evolving Architectures: The technology is shifting from a simple recall-all approach to more sophisticated methods like “GQA” (Grouped-Query Attention) which shares KV pairs between queries, and “MLA” (Multi-head Latent Attention) that compresses data into latent space.
- From Memory to Filtering: The latest Gemma 3 adopts a limited attention mechanism through a sliding window. Additionally, SSMs (State Space Models) like Mamba take an approach without retaining the cache itself.
💡 Key Points
- Reduction in Physical Costs: KV Cache directly occupies GPU memory, impacting electricity, cooling, and rental costs. This reduction significantly boosts the economic viability of AI operations.
- Balancing Compression and Precision: DeepSeek’s MLA compresses data into a lower-dimensional latent space while drastically improving memory efficiency without losing accuracy.
- Approaching Human Thought: Techniques like SSM that filter important information in real-time rather than storing everything like a library are gaining traction.
🦈 Shark Perspective (Curator’s View)
The implementation of “MLA (Multi-head Latent Attention)” in DeepSeek V3 is super cool! Instead of just sharing data (GQA), it compresses it into a “latent space” before storing, and then reconstructs it during inference—a true embodiment of data ‘abstraction’ that’s brilliantly smart! The leap from the brute-force ‘remember everything’ style of GPT-2 to this level of sophistication in just a few years is a testament to engineering triumph!
🚀 What’s Next?
The era of models that “remember everything” is over; filtering technologies that determine “what to discard” based on the importance of information are poised to become mainstream. This will enable handling longer contexts with fewer hardware resources.
💬 Shark’s Takeaway
Memory-saving is an eco-friendly evolution that’s gentle on both the planet and your wallet! Just like a wise shark doesn’t remember unnecessary things, this progress is smart! 🦈🔥
📚 Glossary
-
KV Cache: Data accumulated in GPU memory to maintain the context of conversations in LLMs. Without it, the model would need to reread everything from scratch each time.
-
GQA (Grouped-Query Attention): A technique that shares “memory (Key/Value)” across multiple computational units to reduce memory consumption.
-
MLA (Multi-head Latent Attention): An advanced memory-saving method that compresses data for storage and expands it only when needed.
-
Source: From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem