Tokenization
What
Splitting text into tokens — the atomic units a model actually processes. Every NLP model starts here. The choice of tokenizer determines what the model can see and how efficiently it represents text.
Levels of tokenization
| Level | Example (“unhappiness”) | Pros | Cons |
|---|---|---|---|
| Character | u, n, h, a, p, p, i, n, e, s, s | Tiny vocabulary, handles anything | Very long sequences, no word meaning |
| Word | unhappiness | Intuitive | Huge vocabulary, can’t handle unseen words |
| Subword | un, happiness | Balanced vocab/length | Requires training the tokenizer |
Subword tokenization won. Every modern model uses it.
BPE (Byte-Pair Encoding)
Used by GPT models. Start with characters, repeatedly merge the most frequent pair into a new token.
Step 0: ['l', 'o', 'w', 'e', 'r']
Step 1: merge 'l'+'o' → 'lo': ['lo', 'w', 'e', 'r']
Step 2: merge 'lo'+'w' → 'low': ['low', 'e', 'r']
...continue until target vocab size
Common words become single tokens. Rare words split into meaningful pieces.
WordPiece
Used by BERT. Similar to BPE but merges based on likelihood improvement (not raw frequency). Prefixes subword continuations with ##:
"tokenization" → ["token", "##ization"]
SentencePiece
Language-agnostic — treats input as raw bytes, no language-specific preprocessing. Works for any language without needing word boundary rules. Used by T5, LLaMA.
Vocabulary size tradeoffs
- Smaller vocab (32k): more tokens per text, longer sequences, but each token is more common and better learned
- Larger vocab (100k+): shorter sequences, but many rare tokens with poor representations
- GPT-2: 50,257 tokens. LLaMA: 32,000. GPT-4: ~100k
Special tokens
Models use reserved tokens with specific meanings:
[CLS]— classification token (BERT: start of input)[SEP]— separator between segments[PAD]— padding for batch alignment[MASK]— masked token for MLM training<|endoftext|>— end of sequence (GPT)
Python example
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tok("Tokenization is fundamental to NLP")
print(tok.convert_ids_to_tokens(tokens["input_ids"]))
# ['[CLS]', 'token', '##ization', 'is', 'fundamental', 'to', 'nl', '##p', '[SEP]']
# decode back
print(tok.decode(tokens["input_ids"]))
# [CLS] tokenization is fundamental to nlp [SEP]Links
- Text Preprocessing — tokenization is the first preprocessing step
- Embeddings — tokens get mapped to vectors
- Language Models — tokenizer choice affects model behavior
- NLP Roadmap