[Day 55] Learning about tokenization in LLMs
Hello :)
Today is Day 55!
A quick summary of today:- continuing on the Andrej Karpathy streak from the last few days, today I watched his latest video about tokenization in LLMs
- I used a pokemon name dataset on yesterday's MLP code (with the manual backprop) and posted it on Kaggle for easier access
So... tokenization in language models.
Using this nice website we can see an example below.
Using the above picture as refernce, tokenization is how we turn a piece of text to a representation through numbers. In the above pic, each coloured piece is a different token that is part of GPT2, and on the right is each number.There are different ways to turn a piece of text into numbers, and one of the common ones is to use utf8 encoding, which has 256 base bytes.
As seen in the example pic, using utf8 we can encode this piece of text.In addition to utf8, there is utf16 and utf32, but apparently, those encodings 'overshoot' and are too much for our task. And as we can see in the pic below, there are a bunch of 0s being added, which is unnecessarily.
Ok, we can use utf8, but just using utf8 will result in very long sequences of bytes, and transformers only have a limited amount of attention (for computational reasons), so we want to squash these long sequences into something more appropriate. We dont want to use the raw bytes, so instead we can use byte pair encoding.
There is also the compression ratio, which measures how by how much did we reduce the bytes of our text.
In the GPT2 paper, they mention that sometimes the same work is tokenized multiple times when it is next to some special symbols.So what they do is, they have a regex function as below.
Special tokens
Special tokens are used to create a special structure to token streams. Popular ones are:
<|endoftext|> <|eos|> <|pad|> <|bos|> <|eol|> <|math|> <|doc|> <|reward|>
For example the <|endoftext|> one, tells the model that the text that comes after has no relation to the one before it. But if we pass them to chatgpt they trigger weird behaviour.
Actuall this last token <|reward|> I found from this presentation from Andrej Karpathy on Microsoft open dev day.
I passed it to chatgpt and:It does not see it, and also does not react to any of the other special tokens. In other more complicated prompts it will ignore it and may cause other weird behaviour. (which is interesting, and kind of funny - can chatgpt be 'broken' by using some complicated prompt with these special tokens?)After the lecture finished, I went onto my personal code colab and started playing around with the tokenization on random text. And also did one of the tests in Andrej Karpathy's github, which is testing the wikipedia bpe example, and seeing if we get a correct output representation.
And indeed we get the right answer.
Andrej Karpathy also gave some questions to which tokenization was the answer, and I took note of his answers.
- Why can't LLM spell words? **Tokenization**.
- Some tokens are too long and if a token is input (i.e. .DefaultCellStyle) GPT becomes confused.
- Why can't LLM do super simple string processing tasks like reversing a string? **Tokenization**.
- Same as above
- Why is LLM worse at non-English languages (e.g. Japanese)? **Tokenization**.
- Lack of training data, and the tokenizer also is not trained well on languages other than English. (안녕하세요 is 3 tokens, Hello is 1 token)
- Why is LLM bad at simple arithmetic? **Tokenization**.
- Sometimes numbers are tokenized in one way, other times, in another way, very arbitrary. For example, if we give 3215, the tokenizer once will see 3 and 215, another case it will see 32 and 15.
- Why did GPT-2 have more than necessary trouble coding in Python? Partly **Tokenization**.
- Spacing in tokenization.
- What is this weird warning I get about a "trailing whitespace"? **Tokenization**.
- The model has not seen whitespace by itself a lot, so it can cause the warning if we put whitespace after the given text and ":", when we do text completion.
- Why the LLM break if I ask it about "SolidGoldMagikarp"? **Tokenization**.
- https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
- Why should I prefer to use YAML over JSON with LLMs? **Tokenization**.
- YAML is efficient in tokens, compared to JSON
- What is the real root of suffering? **Tokenization**. xD
Now, for the pokemon name generator based on the manual backprop code.
I did not change much from the original code, the main change as I mentioned yesterday is me adding notes while doing the backprop, so that I can fully understand what is happening. Today I decided to try to adopt it to a new dataset (Andrej used human names, I decided to use pokemon names) and also play around with the model's hyperparams and see its reaction.
The full notebook (which is not different from the google colab link I shared yesterday) is here.
But the most interesting part is the generated names:
- libico abesawing cabezllish
- kirzons argola drapiad
- xutdi swirlixs toldan
- simiroty ledics bruxishoinpaul
- ferrorua nidoran♂ hicmy
- hippoono typcanth skorudi
- koffing charmer roserire
- tornu venonat girdreec
- derno herdeer beynode
- snobbull eevile wlabin
- scazarill slowperno savimzar
- cramsampardos electakron
Not bad haha. This is using 5 context length, 10 as embeddings size and 128 hidden layers. (full info in the kaggle link)
That is all for today!
See you tomorrow :)