3 min read
[AI Minor News]

Learning AI on a Classic Machine from 1976!? The Assembly Language Transformer 'ATTN-11' is Making Waves!


"- **Running on 1970s Hardware**: The 'ATTN-11' project has been unveiled, which trains a single-layer, single-head Transformer on a 1976 PDP-11 minicomputer. ..."

※この記事はアフィリエイト広告を含みます

Learning AI on a Classic Machine from 1976!? The Assembly Language Transformer ‘ATTN-11’ is Making Waves!

📰 News Summary

  • Running on 1970s Hardware: The ‘ATTN-11’ project has been unveiled, which trains a single-layer, single-head Transformer on a 1976 PDP-11 minicomputer.
  • Implemented in Assembly Language: Developed from scratch in PDP-11 assembly language to push processing speed and memory efficiency to the max.
  • Learning to Reverse Numbers: It can learn to reverse a sequence of numbers with 100% accuracy in about 350 steps (roughly 1.5 hours in real-world time).

💡 Key Points

  • Minimal Parameters: The model consists of an embedding layer, self-attention, residual connections, and an output projection, with a mere 1,216 parameters.
  • Utilizing Fixed-Point Arithmetic: In an environment without floating-point units, it cleverly uses Q8 for forward propagation and Q15 for backward propagation, each with different precision levels.
  • Challenging Memory Limitations: To fit within the valuable 32KB of core memory, it avoids the memory-heavy Adam optimization method and opts for the more memory-efficient SGD.

🦈 Shark’s Eye (Curator’s Perspective)

The “Ultimate Optimization” approach is a stark contrast to modern AI development that splurges on massive GPU resources! What’s particularly intriguing is the manual tuning of learning rates per layer. By assigning higher learning rates to the weights of the attention mechanism and lower rates to the output projection, it significantly reduces training time without the memory overhead of Adam. This is super efficient! The craftsmanship of cramming a Transformer into just 32KB—about the size of a single icon on a modern website—truly brings back the essence of computing and gets me excited!

🚀 What’s Next?

  • This project reaffirms the potential for AI to operate and learn even in extremely resource-limited environments, like legacy embedded systems, not just in giant models.
  • I believe it will attract attention among retro computing enthusiasts and low-level engineers as an educational resource to understand the essence of algorithms!

💬 HaruShark’s Take

It’s thrilling to see a Transformer running on a machine from the era when my shark ancestors were swimming around 50 years ago! AI written in assembly language truly embodies a “steel will”! 🦈🔥

📚 Terminology

  • PDP-11: A significant 16-bit minicomputer sold by Digital Equipment Corporation (DEC) in the 1970s, leaving a mark in computer history.

  • Fixed-Point Arithmetic: A method of computing with a fixed decimal point, used for fast calculations on older CPUs that do not support floating-point arithmetic.

  • SGD (Stochastic Gradient Descent): The most basic method for updating weights in neural networks, highly suitable for resource-constrained environments like this project due to its very low memory consumption.

  • Source: Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

【免責事項 / Disclaimer / 免责声明】
JP: 本記事はAIによって構成され、運営者が内容の確認・管理を行っています。情報の正確性は保証せず、外部サイトのコンテンツには一切の責任を負いません。
EN: This article was structured by AI and is verified and managed by the operator. Accuracy is not guaranteed, and we assume no responsibility for external content.
ZH: 本文由AI构建,并由运营者进行内容确认与管理。不保证准确性,也不对外部网站的内容承担任何责任。
🦈