Text Classification
What
Assign a label (or labels) to text: sentiment, topic, spam, intent, language, toxicity, etc.
Multi-class vs Multi-label
- Multi-class: exactly one label per document (sentiment: positive/negative/neutral)
- Multi-label: multiple labels can apply (a news article can be both “politics” and “economy”)
- Hierarchical: labels have parent-child relationships (Sports → Football → Premier League)
Approaches (simplest → most powerful)
1. TF-IDF + classical ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(
TfidfVectorizer(max_features=10000, sublinear_tf=True),
LogisticRegression(max_iter=1000, class_weight='balanced')
)
model.fit(texts_train, labels_train)Good for: quick baseline, imbalanced data (class_weight=‘balanced’), interpretability needed.
2. Fine-tuned transformer
from transformers import pipeline
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("This movie was fantastic!")
# [{'label': 'POSITIVE', 'score': 0.9998}]For custom classes:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize and train
train_encodings = tokenizer(texts_train, truncation=True, padding=True, max_length=512)
# ... prepare Dataset, then train with Trainer3. Zero-shot with LLM
No training data needed — describe the task in a prompt:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
"I need to cancel my flight",
candidate_labels=["travel", "finance", "tech", "customer service"]
)
# {'labels': ['travel', 'customer service', 'finance', 'tech'],
# 'scores': [0.85, 0.08, 0.05, 0.02]}For better zero-shot: use larger models (GPT-4, Claude) or fine-tuned models like MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli-ling-wanli.
When to use what
| Approach | Data needed | Quality | Speed | Best for |
|---|---|---|---|---|
| TF-IDF + LR | 100s | Good baseline | Very fast | Spam, topics, simple categories |
| Fine-tuned BERT | 1,000s | Best for specific task | Fast inference, slow training | Sentiment, intent, domain-specific |
| Zero-shot LLM | None | Good general, inconsistent | Slow, expensive | Rapid prototyping, open-ended categories |
Rule: Start with TF-IDF + LR as a baseline. If F1 > 0.85, you may not need deep learning. If lower, try fine-tuned transformers.
Class imbalance
Real-world text classification is almost always imbalanced (spam is 95% of email, fraud is 0.1% of transactions):
# Option 1: class weights
LogisticRegression(class_weight='balanced')
# Option 2: oversampling
from imblearn.over_sampling import RandomOverSampler
# Option 3: stratified splits — always use this
from sklearn.model_selection import StratifiedKFold
# Option 4: focal loss for transformers
# Adjust loss to focus on hard/misclassified examplesEvaluation
Multi-class:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
# Precision, recall, F1 per class
# Macro F1 (unweighted mean), weighted F1, micro F1Multi-label:
from sklearn.metrics import classification_report
# Convert to binary vectors (one-hot per label)
print(classification_report(y_true_multi, y_pred_multi, target_names=label_names))Threshold tuning: For probability outputs, don’t assume 0.5 is optimal:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_true, y_proba)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores)]Dataset size guidance
| Task complexity | Data needed | Notes |
|---|---|---|
| Binary sentiment | 1,000–10,000 | Sentiment lexicons help with little data |
| Multi-class topics | 5,000–50,000 | More labels = more data needed |
| Intent classification | 100–1,000 per intent | Few-shot works well for intents |
| Domain-specific (medical, legal) | 10,000+ | Domain shift is severe |
| Zero/few-shot | 0–100 | Prompt engineering matters more |
Practical tips
-
Clean your labels: Inter-annotator agreement (Cohen’s κ > 0.8). Label noise is the ceiling for model performance.
-
Stratified splits: Use
StratifiedKFoldto maintain class distribution across train/test/val. -
Text preprocessing: For BERT-level models, minimal preprocessing (just lowercase, clean whitespace). For TF-IDF, more preprocessing helps.
-
Domain adaptation: A model trained on news won’t work for clinical notes. Fine-tune on in-domain data, even if small.
-
Error analysis: Look at false positives and false negatives — systematic errors reveal data or label issues, not model issues.
-
Calibration: For production, calibrate probabilities (temperature scaling) — raw logits are often overconfident.
Key papers
- BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2019) — arXiv:1810.04805
- Rethinking the Inception Architecture (Szegedy et al., 2016) — label smoothing
- Revisiting Few-Shot NER* (Fritzler et al., 2020) — few-shot NER
Links
- Bag of Words and TF-IDF
- BERT and Masked Language Models
- Evaluation Metrics
- Prompt Engineering — zero-shot classification