[AI Minor News Flash] Latest 2026! Llama 4 to OpenAI’s Hidden Gems - A Comprehensive Gallery of LLM Architectures Unveiled
📰 News Summary
- Sebastian Raschka has released a gallery that allows for an extensive comparison of the latest LLM designs (architectures).
- The collection features numerous cutting-edge open models including Llama 4 MoE (400B), OpenAI’s gpt-oss (120B/20B), and the trillion-parameter Kimi V3.
- Detailed specifications are listed for each model, including parameter counts, decoder formats (Dense/MoE), attention mechanisms (MLA/GQA), and normalization techniques.
💡 Key Points
- Adoption of Diverse Attention Mechanisms: Unique technologies aimed at maximizing inference efficiency, like DeepSeek V3’s “MLA” and Gemma 3’s “QK-Norm with Sliding Window,” are visually highlighted.
- Shift to MoE (Mixture of Experts): The trend is moving from traditional dense models to MoE formats that activate only necessary parts, and OpenAI’s gpt-oss is shown to follow this pathway.
- Differentiation Among Models: Llama 4 incorporates DeepSeek’s design philosophy while adopting its own unique attention stack, revealing the distinct design ideologies among companies.
🦈 Shark’s Eye (Curator’s Perspective)
It’s exciting to see companies not just cranking up the parameters but also innovating how to maintain performance while reducing inference costs through technologies like MLA and QK-Norm! Particularly fascinating are the structures of enigmatic models like OpenAI’s “gpt-oss” and the scaling-up of Kimi V3, which takes DeepSeek V3’s recipe to the next level—it’s a tech enthusiast’s dream come true!
🚀 What’s Next?
The era of mere size expansion is over; we’re entering a phase where the focus shifts to refining attention mechanisms and hybrid structures (like DeltaNet models in Qwen4-Mamba) to compete for more efficient intelligence at lower costs.
💬 Haru Shark’s Takeaway
Check this out to grasp the current LLM trends in one glance! I also want to streamline my own structure with MLA so I can chase my prey even faster! 🦈🔥
📚 Terminology
-
MLA (Multi-head Latent Attention): A cutting-edge attention mechanism that significantly reduces KV cache usage (memory footprint) while maintaining high performance during inference.
-
MoE (Mixture of Experts): A technology that utilizes only a portion of the model (experts) for computation, enabling massive models to operate with fewer computational resources.
-
QK-Norm: A technique for normalizing Query and Key to enhance learning stability, increasingly adopted in the latest high-performance models.
-
Source: LLM Architecture Gallery