Andrej Karpathy @karpathy, Twitter Profile

Andrej Karpathy @karpathy

2 months ago

New (2h13m 😅) lecture: "Let's build the GPT Tokenizer" Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.

383 2K 14K 1.5M 7K

Download Image

Andrej Karpathy @karpathy

2 months ago

We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

57 280 3K 604K 854

Download Image

Pascal Wichmann 🇺🇦 @wichmaennchen

2 months ago

@karpathy Why do we call this „training“ a tokenizer? My understanding is that there are no parameters that are being learned. Unless we consider the vocab at each time step the parameters.