Tokenization

What

Splitting text into tokens — the atomic units a model actually processes. Every NLP model starts here. The choice of tokenizer determines what the model can see and how efficiently it represents text.

Levels of tokenization

Level	Example (“unhappiness”)	Pros	Cons
Character	u, n, h, a, p, p, i, n, e, s, s	Tiny vocabulary, handles anything	Very long sequences, no word meaning
Word	unhappiness	Intuitive	Huge vocabulary, can’t handle unseen words
Subword	un, happiness	Balanced vocab/length	Requires training the tokenizer

Subword tokenization won. Every modern model uses it.

BPE (Byte-Pair Encoding)

Used by GPT models. Start with characters, repeatedly merge the most frequent pair into a new token.

Step 0: ['l', 'o', 'w', 'e', 'r']
Step 1: merge 'l'+'o' → 'lo':  ['lo', 'w', 'e', 'r']
Step 2: merge 'lo'+'w' → 'low': ['low', 'e', 'r']
...continue until target vocab size

Common words become single tokens. Rare words split into meaningful pieces.

WordPiece

Used by BERT. Similar to BPE but merges based on likelihood improvement (not raw frequency). Prefixes subword continuations with ##:

"tokenization" → ["token", "##ization"]

SentencePiece

Language-agnostic — treats input as raw bytes, no language-specific preprocessing. Works for any language without needing word boundary rules. Used by T5, LLaMA.

Vocabulary size tradeoffs

Smaller vocab (32k): more tokens per text, longer sequences, but each token is more common and better learned
Larger vocab (100k+): shorter sequences, but many rare tokens with poor representations
GPT-2: 50,257 tokens. LLaMA: 32,000. GPT-4: ~100k

Special tokens

Models use reserved tokens with specific meanings:

[CLS] — classification token (BERT: start of input)
[SEP] — separator between segments
[PAD] — padding for batch alignment
[MASK] — masked token for MLM training
<|endoftext|> — end of sequence (GPT)

Python example

from transformers import AutoTokenizer
 
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
 
tokens = tok("Tokenization is fundamental to NLP")
print(tok.convert_ids_to_tokens(tokens["input_ids"]))
# ['[CLS]', 'token', '##ization', 'is', 'fundamental', 'to', 'nl', '##p', '[SEP]']
 
# decode back
print(tok.decode(tokens["input_ids"]))
# [CLS] tokenization is fundamental to nlp [SEP]

AI/ML Notes

Explorer

Tokenization

Tokenization

What

Levels of tokenization

BPE (Byte-Pair Encoding)

WordPiece

SentencePiece

Vocabulary size tradeoffs

Special tokens

Python example

Links

Graph View

Table of Contents

Backlinks