Exercise in Historical Language Modeling: a Language Model Trained Entirely on Victorian Literature

huggingface.co

Exercise in Historical Language Modeling: a Language Model Trained Entirely on Victorian Literature

huggingface.co

eifachposteMB to AI (Reddit RSS)English · 2 hours ago

Mr. Chatterbox - a Hugging Face Space by tventurella

huggingface.co

Enter any question or comment in the chat box and the assistant replies using language and ideas from 19th‑century British literature. The app only needs you to type text and it returns a text resp...

Original Reddit post

Hey all - I built a small LLM experiment called Mr. Chatterbox, a chatbot trained entirely on books published during the Victorian era (1837–1899). It was trained on a subset of the BL Books dataset , then fine-tuned on a mix of corpus and synthetic data. I used nanochat for the initial training and supervised fine-tuning rounds. SFT consisted of two rounds: one round of two epochs on a large dataset (over 40,000 pairs) of corpus material and synthetic data, and a smaller round that focused on specific cases like handling modern greetings, goodbyes, attempted prompt injections, etc. The model is about 340 million parameters, and so far it’s quite good at discussing Victorian topics (like Darwin, the railroads, etc.) and staying in an authentic victorian voice. As a relatively small model, it can get confused and it definitely has some limitations. To overcome them I’m thinking that I may implement direct preference optimization as a means to continue to improve the model. Anyway, I would love to know if others here have experience with this kind of thing, and hear your experience with the model! submitted by /u/centerstate

Originally posted by u/centerstate on r/ArtificialInteligence

You must log in or # to comment.

Chat