Text Preprocessing

What

Turning raw text into a format models can work with.

Steps

Tokenization

Split text into units (words, subwords, characters).

# Simple word tokenization
text = "Hello, world! How are you?"
tokens = text.lower().split()  # ['hello,', 'world!', 'how', 'are', 'you?']
 
# Subword tokenization (what modern models use)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness")  # ['un', '##happiness']

Cleaning

import re
text = re.sub(r'<[^>]+>', '', text)     # remove HTML
text = re.sub(r'[^\w\s]', '', text)     # remove punctuation
text = text.lower().strip()

Stopword removal (classical NLP)

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# Remove common words: "the", "is", "at", "which"...
# Not needed for transformer models — they handle context

Stemming/Lemmatization (classical NLP)

  • Stemming: “running” → “run” (crude, fast)
  • Lemmatization: “better” → “good” (uses dictionary, accurate)

Modern approach

For transformer-based models, you mostly just need the tokenizer. The model handles everything else. Preprocessing is mainly for classical methods (BoW, TF-IDF).