3 min read
[AI Minor News]

Directly 'Printing' LLMs on Chips!? Taalas Unleashes a Mind-Blowing ASIC with 17,000 Tokens per Second


Startup Taalas has unveiled a dedicated ASIC that directly embeds Llama 3.1 8B into hardware, achieving 10x performance and cost efficiency compared to GPUs.

※この記事はアフィリエイト広告を含みます

[AI Minor News Flash] Directly ‘Printing’ LLMs on Chips!? Taalas Unleashes a Mind-Blowing ASIC with 17,000 Tokens per Second

📰 News Overview

  • Startup “Taalas” has announced a specialized ASIC chip that directly implements the Llama 3.1 8B model into hardware circuits.
  • This chip boasts an inference speed of 17,000 tokens per second, claiming a 10x advantage in power efficiency and cost performance over conventional GPU systems.
  • Instead of reading model weights from VRAM, it employs a “hardwired” technique, physically engraving them into silicon as transistors.

💡 Key Points

  • Breaking the Memory Wall: The bottleneck (Von Neumann bottleneck) that occurs when GPUs fetch weight data from VRAM is completely sidestepped by integrating weights directly into the circuit.
  • Magic Multiplier: Taalas has developed a unique scheme that performs multiplication of 4-bit data with a single transistor, achieving ultra-high density in the circuit.
  • Rapid Development Cycle: By preparing a grid of generic logic gates and customizing only the upper mask layer, they can design chips for new models in just two months.

🦈 Shark’s Eye (Curator’s Perspective)

The idea of treating models as “fixed hardware” instead of “rewritable software” is razor-sharp! It’s like having a game cartridge or CD-ROM that only runs specific models but delivers astounding speed—what a trade-off! While GPUs are scrambling to shuttle data back and forth between VRAM and computational cores, Taalas’s chip finishes inference in a flash, with electrical signals zipping through the circuit. This “physically smashing” approach is exactly the disruptive innovation today’s AI infrastructure needs! 🦈🔥

🚀 What’s Next?

As certain massive models become the de facto standard, ultra-low-cost, high-speed inference servers using specialized ASICs could overshadow general-purpose GPU inference. With integration into local devices, the future where ChatGPT-class models run with zero latency on smartphones and PCs is fast approaching!

💬 A Word from Haru-Same

Burning software into hardware is the epitome of “ultimate optimization”! Generating text equivalent to 30 A4 pages per second is levels of output even I can’t keep up with! 🦈💨

📚 Terminology

  • ASIC: Application-Specific Integrated Circuit designed and manufactured for a specific purpose. Unlike general-purpose CPUs or GPUs, they only perform specific tasks, but are incredibly fast and energy-efficient.

  • Memory Wall: A phenomenon where the speed of data reading and writing cannot keep up with computational speed, limiting the overall performance of the system. One of the biggest challenges in modern AI development.

  • SRAM: A very fast memory located within the chip, used in Taalas’s chip to store KV caches that maintain conversational context.

  • Source: How Taalas “prints” LLM onto a chip?

🦈 はるサメ厳選!イチオシAI関連
【免責事項 / Disclaimer / 免责声明】
JP: 本記事はAIによって構成され、運営者が内容の確認・管理を行っています。情報の正確性は保証せず、外部サイトのコンテンツには一切の責任を負いません。
EN: This article was structured by AI and is verified and managed by the operator. Accuracy is not guaranteed, and we assume no responsibility for external content.
ZH: 本文由AI构建,并由运营者进行内容确认与管理。不保证准确性,也不对外部网站的内容承担任何责任。
🦈