Layer 1 / 12
INPUT TEXT
"The transformer model processes"
TOKENS (0)
Tokenization
Text is broken into smaller units called tokens that the model can process.
What is a Token?
Tokens can be full words, subwords, or even punctuation. For example:
"empowers" → ["em", "powers"]
Vocabulary
GPT-2 uses a fixed vocabulary of:
50257 tokens
This vocabulary is learned before training and reused for all inputs.
Why Tokenization Matters
Tokens are the bridge between raw text and math — they allow language to be converted into vectors that neural networks can process.
Next Step
Each token will be mapped to an ID, which is then used to retrieve its embedding vector.