Posts

Showing posts from February 25, 2024

[Day 55] Learning about tokenization in LLMs

Image
 Hello :) Today is Day 55! A quick summary of today: continuing on the Andrej Karpathy streak from the last few days, today I watched his latest video about tokenization in LLMs I used a pokemon name dataset on yesterday's MLP code (with the manual backprop) and posted it on Kaggle for easier access So... tokenization in language models.  Using  this  nice website we can see an example below. Using the above picture as refernce, tokenization is how we turn a piece of text to a representation through numbers. In the above pic, each coloured piece is a different token that is part of GPT2, and on the right is each number. There are different ways to turn a piece of text into numbers, and one of the common ones is to use utf8 encoding, which has 256 base bytes.  As seen in the example pic, using utf8 we can encode this piece of text.  In addition to utf8, there is utf16 and utf32, but apparently, those encodings 'overshoot' and are too much for our task. And as we can see in t