Tokenization

What

Splitting text into tokens — the atomic units a model actually processes. Every NLP model starts here. The choice of tokenizer determines what the model can see and how efficiently it represents text.

Levels of tokenization

LevelExample (“unhappiness”)ProsCons
Characteru, n, h, a, p, p, i, n, e, s, sTiny vocabulary, handles anythingVery long sequences, no word meaning
WordunhappinessIntuitiveHuge vocabulary, can’t handle unseen words
Subwordun, happinessBalanced vocab/lengthRequires training the tokenizer

Subword tokenization won. Every modern model uses it.

BPE (Byte-Pair Encoding)

Used by GPT models. Start with characters, repeatedly merge the most frequent pair into a new token.

Step 0: ['l', 'o', 'w', 'e', 'r']
Step 1: merge 'l'+'o' → 'lo':  ['lo', 'w', 'e', 'r']
Step 2: merge 'lo'+'w' → 'low': ['low', 'e', 'r']
...continue until target vocab size

Common words become single tokens. Rare words split into meaningful pieces.

WordPiece

Used by BERT. Similar to BPE but merges based on likelihood improvement (not raw frequency). Prefixes subword continuations with ##:

"tokenization" → ["token", "##ization"]

SentencePiece

Language-agnostic — treats input as raw bytes, no language-specific preprocessing. Works for any language without needing word boundary rules. Used by T5, LLaMA.

Vocabulary size tradeoffs

  • Smaller vocab (32k): more tokens per text, longer sequences, but each token is more common and better learned
  • Larger vocab (100k+): shorter sequences, but many rare tokens with poor representations
  • GPT-2: 50,257 tokens. LLaMA: 32,000. GPT-4: ~100k

Special tokens

Models use reserved tokens with specific meanings:

  • [CLS] — classification token (BERT: start of input)
  • [SEP] — separator between segments
  • [PAD] — padding for batch alignment
  • [MASK] — masked token for MLM training
  • <|endoftext|> — end of sequence (GPT)

Python example

from transformers import AutoTokenizer
 
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
 
tokens = tok("Tokenization is fundamental to NLP")
print(tok.convert_ids_to_tokens(tokens["input_ids"]))
# ['[CLS]', 'token', '##ization', 'is', 'fundamental', 'to', 'nl', '##p', '[SEP]']
 
# decode back
print(tok.decode(tokens["input_ids"]))
# [CLS] tokenization is fundamental to nlp [SEP]