Learning AI on a Classic Machine from 1976!? The Assembly Language Transformer 'ATTN-11' is Making Waves!

#Transformer #PDP-11 #Assembly

※この記事はアフィリエイト広告を含みます

Learning AI on a Classic Machine from 1976!? The Assembly Language Transformer ‘ATTN-11’ is Making Waves!

📰 News Summary

Running on 1970s Hardware: The ‘ATTN-11’ project has been unveiled, which trains a single-layer, single-head Transformer on a 1976 PDP-11 minicomputer.
Implemented in Assembly Language: Developed from scratch in PDP-11 assembly language to push processing speed and memory efficiency to the max.
Learning to Reverse Numbers: It can learn to reverse a sequence of numbers with 100% accuracy in about 350 steps (roughly 1.5 hours in real-world time).

💡 Key Points

Minimal Parameters: The model consists of an embedding layer, self-attention, residual connections, and an output projection, with a mere 1,216 parameters.
Utilizing Fixed-Point Arithmetic: In an environment without floating-point units, it cleverly uses Q8 for forward propagation and Q15 for backward propagation, each with different precision levels.
Challenging Memory Limitations: To fit within the valuable 32KB of core memory, it avoids the memory-heavy Adam optimization method and opts for the more memory-efficient SGD.

🦈 Shark’s Eye (Curator’s Perspective)

The “Ultimate Optimization” approach is a stark contrast to modern AI development that splurges on massive GPU resources! What’s particularly intriguing is the manual tuning of learning rates per layer. By assigning higher learning rates to the weights of the attention mechanism and lower rates to the output projection, it significantly reduces training time without the memory overhead of Adam. This is super efficient! The craftsmanship of cramming a Transformer into just 32KB—about the size of a single icon on a modern website—truly brings back the essence of computing and gets me excited!

🚀 What’s Next?

This project reaffirms the potential for AI to operate and learn even in extremely resource-limited environments, like legacy embedded systems, not just in giant models.
I believe it will attract attention among retro computing enthusiasts and low-level engineers as an educational resource to understand the essence of algorithms!

💬 HaruShark’s Take

It’s thrilling to see a Transformer running on a machine from the era when my shark ancestors were swimming around 50 years ago! AI written in assembly language truly embodies a “steel will”! 🦈🔥

📚 Terminology

PDP-11: A significant 16-bit minicomputer sold by Digital Equipment Corporation (DEC) in the 1970s, leaving a mark in computer history.
Fixed-Point Arithmetic: A method of computing with a fixed decimal point, used for fast calculations on older CPUs that do not support floating-point arithmetic.
SGD (Stochastic Gradient Descent): The most basic method for updating weights in neural networks, highly suitable for resource-constrained environments like this project due to its very low memory consumption.
Source: Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer