[Day 199] Continuing with Build an LLM from scratch

 Hello :)
Today is Day 199!


A quick summary of today:
  • saw that all the chapters from the book Build an LLM from scratch have been published so I decided to continue with it (after a few moths of waiting)


I like that even though we are in chapter 5 (out of 7), the author still reminds us the learners the process of going from input text to LLM generated text. Making sure we are still on the same page

The goal of this chapter is to train a model because at the moment, when we try to generate some text we get gibberish
At the moment the untrained model is given: Every effort moves you; and the model continues this with rentingetic wasnم refres RexMeCHicular stren

By the way, the current (untrained) model has the following config: 
How can the model learn ? It's weights need to be updated so they start to predict the target tokens. Here comes good ol' backpropagation. And it requires a loss function which calculates the dif between desired and actual output (i.e. how far off the model is)

Perplexity is a metric used with cross-entropy loss to evaluate model performance in tasks like language modeling. It reflects the model's uncertainty in predicting the next token, with lower perplexity indicating better performance. Perplexity is calculated as torch.exp(loss). For example, a loss resulting in 48725 indicates the model is uncertain about predicting among approximately 47,678 possible tokens. This makes perplexity a more interpretable measure than raw loss, as it signifies the effective vocabulary size over which the model is uncertain.


This is an interesting fact from the book:
"The cost of pretraining LLMs
To put the scale of our project into perspective, consider the training of the 7 billion parameter Llama 2 model, a relatively popular openly available LLM. This model required 184,320 GPU hours on expensive A100 GPUs, processing 2 trillion tokens. At the time of writing, running an 8xA100 cloud server on AWS costs around $30 per hour. A rough estimate puts the total training cost of such an LLM at around $690,000 (calculated as 184,320 hours divided by 8, then multiplied by $30)." - wow

Starting to train ~ we can see the model output as well. We give it Every effort moves you 
And by the 10th last epoch it starts to look like normal text
A reminder that the text used is "The Verdict" short story by Edith Wharton. The train loss starts at 10.058 and converges to 0.502. 
Seems the model is overfitting after the 2nd epoch which is not that surprising given the small datataset. 

Sampling strategies

Temperature - the lower, the 'less creative' the model is


Top-k - picking from only top-k tokens with the highest logits when generating


On another note ~ I added new clips to the Glaswegian dataset. 

We are now at ~1000 clips and total of 86 minutes of audio. Here is the link to the dataset on hugging face

Another note ~ a lab mate told me about a banking AI competition.

The bank is Kukmin Bank (국민은행 in Korean) and the competition started yesterday and the submission is 11th of Aug. We would have to develop some kind of project related to helping customers/workers using AI. We will discuss it more in the coming days/weeks. I am excited for it. Here is a link to the competition (it is in Korean).


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[Day 107] Transforming natural language to charts

[Day 54] I became a backprop ninja! (woohoo)