[Day 52] Learning more about transformers with Andrej Karpathy

Hello :)
Today is Day 52!

A quick summary of today:

GPT from scratch with Andrej Karpathy
Re-watching his other tutorials

Firstly, I will mention about what I learned from the GPT from scratch tutorial.

Andrej Karpathy's videos introduced me to PyTorch for the 1st time. Actually the first video I saw which was about building a neural network from scratch, I kinda got what was happening, but as he went deeper and deeper and started writing code as if it was PyTorch (or building a neural network on a lower level compared to TensorFlow), I felt like I got slapped in the face hahaha. TF is much higher level than PyTorch, and at first it felt weird to have to write the whole dataset creation, model, training, eval, etc by myself. But with some practice it grew on me and now PyTorch feels more comfortable than TF, it feels more 'down-to-earth' ('down-to-python') haha.

So, on Day 46 I really tried to understand and learn transformers through KAIST's professor Choi and his lecture. And it was incredibly helpful. And today I got reminded that Andrej Karpathy has a lecture/video about GPT from scratch, and I can use that to confirm whether I understand transformers, and even see an implementation from scratch from the man himself - Andrej Karpathy.

So, here are some of my notes from the GPT from scratch video.

In the transformer there is the trick that is done for efficient self-attention.

We can do it with a for loop, but this is inefficient.

When we do self-attention, we want for every token to have the attention to itself and the tokens before it. A basic way to do it is to use for loops to average the past context. As in the pic below.

In this pic, what we do is in xbox, the 1st row is a an avg of the 1st row in x, then the 2nd row is an avg of the first two rows (tokens), etc. But this is not what is used in transformers, it is inefficient.

Instead,

in pytorch there is this torch.tril method that creates a lower triangular matrix in size X, X. (like var a in the picture)

and when we can also make the matrix to have each row normalized - which is what we have in the printed a. And we can get that same result as in the for loop, but by using matrix multiplication which is more efficient.

Here we create xbow2 which is the result of matrix multiplication

and when we compare xbow (from the for loop) with xbow2 (from the matrix multiplication method), we get the same matrices, as below.

What we want to do next is apply softmax, but before that, rather than 0s in the right triangle of the matrix, we want to fill it with negtive infinites

This is because when we pass -inf to softmax, it will give 0 - which is what we want.

Btw, the first row, where we have 1 value and the rest is inf, this means that in that row, this is the 1st token in the sequence, and then we mask the future, because we don't want the model to cheat and look into the future when it predicts, so we pass new info one by one. Then the 2nd row is 1st and 2nd token, etc. This is why we care for masking the right half of the triangle - to stop the model from cheating by looking into the future tokens. We want it to predict the next token based on the current available information only.
Going back to softmax, after we have the above, and we apply softmax to the rows as they are, we get:

each row in the lower triangular normalized.

But doing it this way with average, we don't learn much about each word. For example in a sentence some words have more importance than others, and some words 'look' for other words. Like in 'Ivan is the guy who ate the sandwich' - probably the word guy places higher priority on Ivan. We want some tokens to find other tokens more interesting.

That is where self-attention comes in. Every token will emmit a Q and a K, and essentially - Q is what am I looking for, and K is what do I contain. So how we do it is a dot product of my Q with all the other Ks, and that dot product, becomes our wei (weights).

These are the raw weights:

Then we mask the future tokens

And apply softmax and now we can tell how much of info to agg from any of these tokens in the past.

For example in the last row of wei[0] we can tell that the last index 'likes' the 3rd index hence it gave it 0.22 priority.

So far so good. And here is the self-attention formula from the paper.

We have softmax, Q, K and V, but we do not have the division by sqrt(dk) which is the amount of heads.

Since we have unit Guassian inputs, the variance of the wei will be similar to the amount of heads (in the code it was 16), and below the var is 15.9.

If we divide by the sqrt(head_size) it is all fixed

But why is that important? Because if we give softmax very big values, it will start returning tensors that have a value very close to 1 and the others close to 0.

Below, we see in the 1st case if we give softmax values that are closer to zero, we get normal outputs.

But in the 2nd cell, if we skyrocket the values, the results are close to one-hot-encoding. And why is that a problem? This would result in all nodes aggregating info from one single node, so they will not be learning a lot, which is important especially in the beginning of training.

I am glad I saw Andrej Karpathy's video today. Also I saw that he even uploaded a new video about GPT's tokenizer. Definitely something on my to-do list.

Also, I rewatched the below three videos of him. There are two more as part of this 'makemore' series, which I will review later.

I noticed from these videos that compared to the 1st time when I watched them, I understand what is going on haha. I think the main problem was that tensorflow does a lot things for the dev whereas with pytorch I have to write things myself. Rather than taking notes and thinking: wait, why ? where? how?, rewatching these felt like a 'oh yes, yes, this and that' - affirming my knowledge and understanding. Simply put, I got some confirmation that I am learning :)

Nevertheless I took some pics of something interesting - related to parameter initialization.

In this pic, the first epoch's loss is 27, and then it jumps down rapidly. At init, the network assigns to some characters higher probability distribution compared to others and some characters are very confident and some are the opposite. And the model becomes very confidently wrong which results in the high loss - and this can be caused by the activations in the last layers on in the first layers. If we init correctly, we can help training be more efficient and avoid hockey stick loss curves.

Also if we see the output of the cell below, we get 3.29, which is the negative log likelihood that we should expect in our case where we have 27 characters - log(1/27.). Such big loss on the 1st epoch may indicate incorrect param inits, and if we init correctly, we should get a loss that is close to what we would expect.

Also with regard to vanishing gradients he made these nice graphs

The 1st one - activations' distributions across layers.

The 2nd one - gradient's distributions across layers.
But what we see is that the distributions change throughout the model.

In the forward pass' case, the model we use, uses tanh activations, and tanh squashes the activations, and they start to reach towards -1 and 1, and if we zoom in on the pic, we can see that the 1st layer (blue) we see a normal distribution, but as we go, the distributions starts to become more saturated - which can result in vanishing gradients in the backward pass.

In the backward pass' case, we can see vanishing gradients, in the 1st layer (which in this case is the last layer - layer 4), the gradients have a normal distribution, but as we go through training, the distributions become squashed down and more saturaded - and we can see vanishing gradients.

Actually I started watching the 'become a backprop ninja' video by Andrej, but in there he mentioned an article - Yes you should understand backprop, where I saw the word 'eingevalue' which was kind of cool. I learned about them and how to calculate them, and now I saw them in the context of matrices.

I will need to refer to this video/his code later when I want to build some kind of a gpt myself.

Another note: I found two courses that have videos/notes + assignments:

Definitely, I need to do these 2 courses.

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning