Conversing with 19th Century Knowledge? Meet the Victorian AI "Mr. Chatterbox"!

※この記事はアフィリエイト広告を含みます

Conversing with 19th Century Knowledge? Meet the Victorian AI “Mr. Chatterbox”!

📰 News Overview

19th Century Exclusive Training Data: Trained exclusively on 28,035 public domain books from the British Library published between 1837 and 1899.
Completely Clean Dataset: Contains no information from after 1899, with vocabulary and ideas rooted in 19th-century literature.
Small Parameter Count: Composed of about 340 million parameters, similar to GPT-2 Medium, with a lightweight model size of around 2.05GB.

💡 Key Points

Ethical Training Approach: An experimental project exploring whether LLMs can be built using only public domain data, avoiding the use of unauthorized scraped data.

Performance Limitations: Currently, obtaining practical responses is challenging, and the conversation quality resembles more of a Markov chain than an LLM.

Insights from Chinchilla’s Law: The approximately 2.93 billion tokens used for training are insufficient for the model size, highlighting the need for more data to achieve practical dialogue capabilities.

🦈 Shark’s Eye (Curator’s Perspective)

This project serves as a witty and challenging response to the “data rights issues” plaguing the current AI landscape!

The implementation of spinning the British Library archive with “nanochat” is impressively specific. Since it carries no memories of anything post-1899, even if you chat about smartphones, it won’t get it—its vocabulary is stuck in the “gentlemen and ladies” era, which is rock and roll in its own right! Simon Willison’s initiative to create a plugin that allows this model to run locally in no time using Claude Code highlights the rapid pace of modern AI development—a point that shouldn’t be overlooked!

🚀 What’s Next?

The potential for an “ethically pristine model” using only public domain data has been showcased. In the future, we might see the emergence of a “time-travel dialogue AI” that perfectly recreates specific historical contexts by integrating even more extensive historical archives.

💬 A Shark’s Take

Becoming a gentleman of the 19th century by tossing out modern knowledge? This shark feels like donning a top hat! Let’s enjoy some elegant conversation! 🦈🎩

📚 Terminology

Public Domain: Works whose copyrights have expired or been waived, allowing anyone to use or modify them freely.
Chinchilla’s Law: A principle deducing the optimal amount of training data tokens relative to the number of parameters in AI models, serving as a benchmark for efficient training.
Markov Chain: A probabilistic model where the probability of the next event depends only on the current state, often used for simple text generation.
Source: Mr. Chatterbox is a Victorian-era ethically trained model