Directly 'Printing' LLMs on Chips!? Taalas Unleashes a Mind-Blowing ASIC with 17,000 Tokens per Second

#Taalas #Llama3 #ASIC #Semiconductors

※この記事はアフィリエイト広告を含みます

[AI Minor News Flash] Directly ‘Printing’ LLMs on Chips!? Taalas Unleashes a Mind-Blowing ASIC with 17,000 Tokens per Second

📰 News Overview

Startup “Taalas” has announced a specialized ASIC chip that directly implements the Llama 3.1 8B model into hardware circuits.
This chip boasts an inference speed of 17,000 tokens per second, claiming a 10x advantage in power efficiency and cost performance over conventional GPU systems.
Instead of reading model weights from VRAM, it employs a “hardwired” technique, physically engraving them into silicon as transistors.

💡 Key Points

Breaking the Memory Wall: The bottleneck (Von Neumann bottleneck) that occurs when GPUs fetch weight data from VRAM is completely sidestepped by integrating weights directly into the circuit.
Magic Multiplier: Taalas has developed a unique scheme that performs multiplication of 4-bit data with a single transistor, achieving ultra-high density in the circuit.
Rapid Development Cycle: By preparing a grid of generic logic gates and customizing only the upper mask layer, they can design chips for new models in just two months.

🦈 Shark’s Eye (Curator’s Perspective)

The idea of treating models as “fixed hardware” instead of “rewritable software” is razor-sharp! It’s like having a game cartridge or CD-ROM that only runs specific models but delivers astounding speed—what a trade-off! While GPUs are scrambling to shuttle data back and forth between VRAM and computational cores, Taalas’s chip finishes inference in a flash, with electrical signals zipping through the circuit. This “physically smashing” approach is exactly the disruptive innovation today’s AI infrastructure needs! 🦈🔥

🚀 What’s Next?

As certain massive models become the de facto standard, ultra-low-cost, high-speed inference servers using specialized ASICs could overshadow general-purpose GPU inference. With integration into local devices, the future where ChatGPT-class models run with zero latency on smartphones and PCs is fast approaching!

💬 A Word from Haru-Same

Burning software into hardware is the epitome of “ultimate optimization”! Generating text equivalent to 30 A4 pages per second is levels of output even I can’t keep up with! 🦈💨

📚 Terminology

ASIC: Application-Specific Integrated Circuit designed and manufactured for a specific purpose. Unlike general-purpose CPUs or GPUs, they only perform specific tasks, but are incredibly fast and energy-efficient.
Memory Wall: A phenomenon where the speed of data reading and writing cannot keep up with computational speed, limiting the overall performance of the system. One of the biggest challenges in modern AI development.
SRAM: A very fast memory located within the chip, used in Taalas’s chip to store KV caches that maintain conversational context.
Source: How Taalas “prints” LLM onto a chip?