[Day 66] Starting Build a LLM from scratch by Sebastian Raschka

 Hello :)
Today is Day 66!


A quick summary of today, covered the below chapters from the title book:
  • Chapter 1: Understanding LLMs
  • Chapter 2: Working with text data


I am in a discord server called DS/ML book club, and one of the organisers is Sophia Yang from MistralAI and I saw she posted about the server reading this book called Build a Large Language Model (From Scratch) by Sebastian Raschka (it is still being written but half of the chapters are available for purchase). Given I just completed a very comprehensive NLP course by Stanford Uni, this felt a good next stop - to test my knowledge and understanding of language models and how they work. 

So, below is an intro to the book, quick summary of chapter 1, and then a bit more in depth for chapter 2.


The perhaps a bit more interesting part is Chapter 2 (available upon purchase)

The book goes over the life of a LLM

The content follows the below diagram but not all content is written. 

  1. Understanding LLMs
  2. Working with text data
  3. Coding attention mechanisms
  4. Implementing a GPT model from scratch to generate text
  5. Pretraining on unlabeled data (not published yet)
  6. Finetuning for text classification (not published yet)
  7. Finetuning with human feedback to follow instructions (not published yet)
  8. Using LLMs in practice (not published yet)


Today I covered chapters 1 and 2, chapter 1 was just a quick introduction, and in chapter 2 I built a GPT tokenizer (similar to Andrej Karpathy's tutorial).


Chapter 1 - Understanding LLMs

This was mainly an introduction to LLMs, setting up the stage.

Showing us the Transformer architecture

Types of prompting for LLMs

What kind of data was used for GPT-3

And how text generation works

Chapter 2 - Working with Text Data

Training LLMs with huge amounts of text, just by using next-word predictions can result in surprisingly good results, but we can achieve more by finetuning them. 

2.1 Understanding word embeddings

DNNs cannot process raw text as is, so we need to find a way to turn this text into numbers so that it could be understood. That process for turning raw text/video/audio into continuous number vectors (a format that the DNNs can understand) can be called embedding. 

Earliest algorithm that got popular is word2vec and the main idea behind it is that words that appear in similar context, tend to have similar meanings too. 

In the above, the word embeddings have only 2 dimensions, but the dimensions number is a hyperparameter that can be tuned. 

2.2 Tokenizing text

Splitting the input text into individual tokens is a crucial step in creating embeddings for an LLM. These tokens can either be words, subwords, characters, punctuation, special characters.

2.3 Converting tokens into token IDs

To map these tokens to token IDs, we first need to build a vocabulary that tells us how to map each unique word/character to a unique integer.

Using a simple tokenizer like:

we can encode and decode the a piece of text. 

2.4 Adding special context tokens

If we encode a piece of text x1 to x10000, and then we pass it a new text to encode which contains a word that was not in the original vocabulary, we will get an error, because that new word is not in the original vocabulary. To go around this, we can add special tokens like <|unk|> (there are others like <|bos|>, <|eos|>, <|endoftext|>, <|pad|>)
However, the GPT tokenizer does not use the unk token. Instead, it uses a byte pair encoding tokenizer which breaks down words into subwords. 

2.5 Byte pair encoding

BPE (a more complicated tokenization method) was used to train GPT-2, GPT-3, and ChatGPT. (looked into how this works in depth on Day 55 with Andrej Karpathy's tutorial)

2.6 Data sampling with a sliding window

Now that the corpus is tokenized into integer token IDs, input-target pairs can be created for the LLM.

The input is blue, red is the target, and we can use a sliding window approach.

2.7 Creating token embeddings

The last step is to convert the token IDs into embedding vectors. 

Transforming the token IDs to continuous vectors (embeddings) is necessary for backprop.

Also, thanks to a side note in the book I learned that embedding layers do the same as linnear layers on one-hot encoded representations but more efficiently. The achieve the same - given an index, look it up in an matrix and return the look up entries by that index. 

2.8 Encoding word positions

Token IDs converted to embeddings are good for LLMs. But not so much for Transformers and self-attention because these embeddings do not have a notion of position, so if a token appears twice or more in a sentence, it's embedding vector will be the same. 

To make the embeddings 'position-aware', there are two categories: positional embeddings and absolute positional embeddings. 
'Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location.'
'...the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of "how far apart" rather than "at which exact position."
The second one is better since the model can generalize well to sequences of different lengths. 
Both work, and often it depends on the type of application being built.
OpenAI's GPT models use absolute positional embeddings that are optimized during training, and are not fixed or predefined like in the original Transformer. 

Here is a link to a colab where I practiced following the book.


Side note about something different. Yesterday I learned about probing from CS224N, but I was not exactly sure what it does, and I saw it in the Learning Transferable Visual Models From Natural Language Supervision and An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale papers and wanted to understand what it is about. Found this youtube video and a John Hewitt blogpost that explain it pretty well. 

Probing is basically, testing our models and seeing what kind of features are learned at some stage of the model. By examining the internal representations of the model, researchers can gain insights into how the model processes and understands the input data and also identify features that we are interested in. 


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project