Run Google’s Gemma 4 Locally on Your Mac at Lightning Speed! The New LM Studio CLI is a Game Changer
📰 News Overview
- Google’s latest AI, “Gemma 4 26B-A4B,” showcases performance rivaling 400B models thanks to the Mixture-of-Experts (MoE) architecture, all while using minimal resources.
- The popular app LM Studio has been updated to version 0.4.0, introducing a headless CLI (lms) that operates without a GUI, allowing for direct control from servers or terminals.
- Reports indicate that the 26B model can be executed locally at an impressive speed of 51 tokens per second on a MacBook Pro equipped with the M4 Pro chip.
💡 Key Points
- The Power of MoE: With 26B parameters but activating only 4B (8 experts) per token, inference costs are drastically reduced. It has achieved a high benchmark score of 82.6% on MMLU Pro.
- New Engine “llmster”: The core of LM Studio has transformed into an independent daemon (background service), adding support for parallel request handling and the Model Context Protocol (MCP).
- Privacy and Cost: By not using external APIs, it eliminates latency and prevents data leaks, enabling a fully offline environment.
🦈 Shark’s Eye (Curator’s Perspective)
The crux of this news lies in the synergy between Google’s efficient model “Gemma 4” and LM Studio’s evolution as a “developer tool”! The balance of the 26B-A4B model is particularly impressive. Thanks to MoE, it achieves a groundbreaking combination of “the lightweight nature of a 4B model” and “the intelligence of a model exceeding 10B.” With the unified memory of the M4 Mac, you can summon this beast with just a command, without launching a desktop app. It’s incredibly cool how it crushes the typical local LLM challenges of being “heavy and slow” from both architecture and tool perspectives! 🦈🔥
🚀 What’s Next?
With the rise of headless CLIs that don’t require a GUI, the integration of AI into not just personal PCs but also corporate servers and CI/CD pipelines will accelerate. Moreover, with the proven efficiency of MoE models, we can expect a future where high-performance AI with vast knowledge bases runs smoothly on our devices without being “heavy.”
💬 A Word from Haru-Same
Finally, our shark’s Mac has gained some “thinking muscle”! We’ve entered an era where you can spar (interact) with AI via the command line without worrying about API fees! Shark-tastic times ahead! 🦈✨
📚 Terminology
-
MoE (Mixture of Experts): A technology that combines multiple “expert” models, activating only a subset as needed for each task, allowing massive models to operate quickly while remaining intelligent.
-
Headless: A system that operates without a screen (GUI), controlled via command line or network. It’s lightweight and suited for automation.
-
Token: The smallest unit of text processed by AI. A speed of 51 tokens per second is incredibly fast, far surpassing human reading speeds.
-
Source: Running Gemma 4 locally with LM Studio’s new headless CLI and Claude Code