Text Preprocessing
What
Turning raw text into a format models can work with.
Steps
Tokenization
Split text into units (words, subwords, characters).
# Simple word tokenization
text = "Hello, world! How are you?"
tokens = text.lower().split() # ['hello,', 'world!', 'how', 'are', 'you?']
# Subword tokenization (what modern models use)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness") # ['un', '##happiness']Cleaning
import re
text = re.sub(r'<[^>]+>', '', text) # remove HTML
text = re.sub(r'[^\w\s]', '', text) # remove punctuation
text = text.lower().strip()Stopword removal (classical NLP)
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# Remove common words: "the", "is", "at", "which"...
# Not needed for transformer models — they handle contextStemming/Lemmatization (classical NLP)
- Stemming: “running” → “run” (crude, fast)
- Lemmatization: “better” → “good” (uses dictionary, accurate)
Modern approach
For transformer-based models, you mostly just need the tokenizer. The model handles everything else. Preprocessing is mainly for classical methods (BoW, TF-IDF).