3 min read
[AI Minor News]

Blazing Fast Over 1000tok/s! The Inference LLM 'Mercury 2' Using Diffusion Models is Redefining AI Generation


Inception Labs has unveiled 'Mercury 2', the world’s fastest inference language model based on diffusion models, achieving an astounding generation speed of over 1,000 tokens per second.

※この記事はアフィリエイト広告を含みます

[AI Minor News Flash] Blazing Fast Over 1000tok/s! The Inference LLM ‘Mercury 2’ Using Diffusion Models is Redefining AI Generation

📰 News Summary

  • World’s Fastest Inference LLM: Inception Labs has announced the new model ‘Mercury 2’, built on diffusion models.
  • Astounding Throughput: Achieving a record-breaking speed of 1,009 tokens per second on NVIDIA Blackwell GPUs, surpassing traditional sequential decoding methods.
  • High Compatibility and Features: With a 128K context window, native tool integration, JSON output support, and compatibility with the OpenAI API.

💡 Key Points

  • Shift to ‘Editor Style’: Unlike traditional LLMs that generate one token at a time, Mercury 2 employs parallel refinement to generate multiple tokens simultaneously, achieving over five times the speed.
  • Balancing Inference and Speed: Mercury 2 allows for real-time responses even for tasks requiring advanced reasoning, eliminating the trade-off between inference cost and latency.
  • Affordable Pricing: Priced at $0.25 per million input tokens and $0.75 for output, designed for large-scale production use.

🦈 Shark’s Eye (Curator’s Perspective)

Finally, the very way LLMs “write” has evolved! Traditional AIs were shackled by the linear “sequential decoding” method, typing character by character from left to right, but Mercury 2 boldly dives into parallel generation with diffusion models, akin to turning drafts into polished work in one go—how cool is that?

Especially impressive is crossing the 1,000 tokens/second mark on NVIDIA Blackwell! This has the potential to fundamentally transform how AI agents operate. With the ability to loop through thought processes multiple times in the background without keeping users waiting, we’re gaining the “instantaneity” we’ve always wanted. This is a groundbreaking step that tackles the biggest weakness of inference models—being “smart but slow”—through architectural prowess! 🦈🔥

🚀 What’s Next?

AI with “inference-grade” capabilities will soon become the standard in areas like voice interactions and video avatars, where milliseconds of delay are unacceptable. Moreover, complex multi-hop retrieval-augmented generation (RAG) and autonomous agent loop processing will dramatically speed up, evolving AI interaction from merely a “tool” to an extension of our thinking.

💬 A Word from Haru Shark

The era of typewriters is over! From now on, it’s all about “thinking in an instant and answering in an instant”—the kind of explosive responsiveness we sharks thrive on will become the new standard for AI! 🦈⚡️

📚 Terminology Explained

  • Diffusion Model: A technique for restoring data from noise. While it’s mainstream in image generation, Mercury 2 applies it for parallel text generation.

  • Tokens per Second: A unit of measurement for how many tokens an AI can generate in one second. Higher numbers indicate faster generation speeds.

  • AI Agent: An autonomous AI system that thinks independently and uses external tools to complete tasks based on user instructions.

  • Source: Mercury 2: The fastest reasoning LLM, powered by diffusion

🦈 はるサメ厳選!イチオシAI関連
【免責事項 / Disclaimer / 免责声明】
JP: 本記事はAIによって構成され、運営者が内容の確認・管理を行っています。情報の正確性は保証せず、外部サイトのコンテンツには一切の責任を負いません。
EN: This article was structured by AI and is verified and managed by the operator. Accuracy is not guaranteed, and we assume no responsibility for external content.
ZH: 本文由AI构建,并由运营者进行内容确认与管理。不保证准确性,也不对外部网站的内容承担任何责任。
🦈