If you know me, then you probably know that one of my biggest obsessions through elementary and middle school was the warrior cats book series by Erin Hunter. Part of the reason that I fell in love with this book series and the warrior cats universe is how much content there is to get through— with over 100 main-series books, novellas, and super-editions combined, there was a never-ending flow of content for little me to consume. I used to read every single new book as soon as it came out, and even though I haven’t been keeping up as closely since graduating middle school, the series still holds a very special place in my heart.
Since the popularity of Chat GPT, I thought it might be fun to see if I could train my own neural network with the specific task of generating text that mimics that of the warrior cats books. I watched Andrej Karpathy’s video on nanoGPT and decided to see if I could modify it to write warrior cats book, because you can never have too many warrior cats books, of course. I set my expectations pretty low, considering that I was running everything on my 2016 macbook pro.
I first used just the very first installment of the warrior cats series, Into the Wild, as the input txt file, and tokenized the input each character itself was considered its own token. My training set had 1,003,854 tokens, which left 111,540 tokens in my validation set with roughly a 90/10 split. I ran this for around 2000 iterations on my laptop, which produced some… interesting results…
There was also a fair share of gibberish:
Still, you could definitely see the influences of the characters and settings of the warrior cats books in the samples produced, so this was a decent start. I tried to improve the coherence of the generated content by increasing the input sample size by appending the second book in the series, Fire and Ice, to the input file, and also slightly increasing the number of iterations to 2500. The training loop ran for around 17 minutes, and the samples produced looked ever so slightllyyyy better.
I was still seeing a pretty large amount of unintelligible words though, so I thought that using OpenAI’s TIktoken tokenizer might be a better way of tokenizing the input text. Tiktoken uses byte pair encoding (BPE) to convert text into tokens; two significant advantages that this provides are that it compresses the text and also tries to allow the model to see common subwords when splitting words into tokens. Instead of splitting a word like “meowed” into 6 individual tokens, it might split it into just two, “meow” and “ed”, which could help the model to better understand such grammar structures.
In addition to changing the tokenization method, I also added a third book, Forest of Secrets, to the input text, and increased the number of iterations to 4500. Despite tripling the number of books represented in the input text, the number of tokens in the training set was reduced to 763,576 with the new tokenization method. This took around 4 hours to train on my laptop, so I would recommend either using a GPU or simply letting it run on your laptop overnight:)
I woke up the next morning to a nice surprise- a new warrior cats book generated just for me 😎
Some other interesting excerpts:
“They are round,” Bluestar interrupted him. “You will take bringing back of them.”
“You may be long kittypet!” spat Graystripe, sounding strong and haunches on her eyes. “I’m better check the whole Clan you.”
“That’s all right,” Tigerclaw meowed. “The dogs is very battle.”
Even though I’m not sure that what was produced is quality literature, it certainly is interesting to read— maybe I’ll be able to improve upon the model one day and produce the 115th installment of the series that would make Erin Hunter proud.
Until next time!