[AI Minor News Flash] Blazing Fast Over 1000tok/s! The Inference LLM ‘Mercury 2’ Using Diffusion Models is Redefining AI Generation
📰 News Summary
- World’s Fastest Inference LLM: Inception Labs has announced the new model ‘Mercury 2’, built on diffusion models.
- Astounding Throughput: Achieving a record-breaking speed of 1,009 tokens per second on NVIDIA Blackwell GPUs, surpassing traditional sequential decoding methods.
- High Compatibility and Features: With a 128K context window, native tool integration, JSON output support, and compatibility with the OpenAI API.
💡 Key Points
- Shift to ‘Editor Style’: Unlike traditional LLMs that generate one token at a time, Mercury 2 employs parallel refinement to generate multiple tokens simultaneously, achieving over five times the speed.
- Balancing Inference and Speed: Mercury 2 allows for real-time responses even for tasks requiring advanced reasoning, eliminating the trade-off between inference cost and latency.
- Affordable Pricing: Priced at $0.25 per million input tokens and $0.75 for output, designed for large-scale production use.
🦈 Shark’s Eye (Curator’s Perspective)
Finally, the very way LLMs “write” has evolved! Traditional AIs were shackled by the linear “sequential decoding” method, typing character by character from left to right, but Mercury 2 boldly dives into parallel generation with diffusion models, akin to turning drafts into polished work in one go—how cool is that?
Especially impressive is crossing the 1,000 tokens/second mark on NVIDIA Blackwell! This has the potential to fundamentally transform how AI agents operate. With the ability to loop through thought processes multiple times in the background without keeping users waiting, we’re gaining the “instantaneity” we’ve always wanted. This is a groundbreaking step that tackles the biggest weakness of inference models—being “smart but slow”—through architectural prowess! 🦈🔥
🚀 What’s Next?
AI with “inference-grade” capabilities will soon become the standard in areas like voice interactions and video avatars, where milliseconds of delay are unacceptable. Moreover, complex multi-hop retrieval-augmented generation (RAG) and autonomous agent loop processing will dramatically speed up, evolving AI interaction from merely a “tool” to an extension of our thinking.
💬 A Word from Haru Shark
The era of typewriters is over! From now on, it’s all about “thinking in an instant and answering in an instant”—the kind of explosive responsiveness we sharks thrive on will become the new standard for AI! 🦈⚡️
📚 Terminology Explained
-
Diffusion Model: A technique for restoring data from noise. While it’s mainstream in image generation, Mercury 2 applies it for parallel text generation.
-
Tokens per Second: A unit of measurement for how many tokens an AI can generate in one second. Higher numbers indicate faster generation speeds.
-
AI Agent: An autonomous AI system that thinks independently and uses external tools to complete tasks based on user instructions.
-
Source: Mercury 2: The fastest reasoning LLM, powered by diffusion