Text Classification

What

Assign a label (or labels) to text: sentiment, topic, spam, intent, language, toxicity, etc.

Multi-class vs Multi-label

  • Multi-class: exactly one label per document (sentiment: positive/negative/neutral)
  • Multi-label: multiple labels can apply (a news article can be both “politics” and “economy”)
  • Hierarchical: labels have parent-child relationships (Sports → Football → Premier League)

Approaches (simplest → most powerful)

1. TF-IDF + classical ML

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
 
model = make_pipeline(
    TfidfVectorizer(max_features=10000, sublinear_tf=True),
    LogisticRegression(max_iter=1000, class_weight='balanced')
)
model.fit(texts_train, labels_train)

Good for: quick baseline, imbalanced data (class_weight=‘balanced’), interpretability needed.

2. Fine-tuned transformer

from transformers import pipeline
 
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("This movie was fantastic!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

For custom classes:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
 
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 
# Tokenize and train
train_encodings = tokenizer(texts_train, truncation=True, padding=True, max_length=512)
# ... prepare Dataset, then train with Trainer

3. Zero-shot with LLM

No training data needed — describe the task in a prompt:

from transformers import pipeline
 
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "I need to cancel my flight",
    candidate_labels=["travel", "finance", "tech", "customer service"]
)
# {'labels': ['travel', 'customer service', 'finance', 'tech'],
#  'scores': [0.85, 0.08, 0.05, 0.02]}

For better zero-shot: use larger models (GPT-4, Claude) or fine-tuned models like MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli-ling-wanli.

When to use what

ApproachData neededQualitySpeedBest for
TF-IDF + LR100sGood baselineVery fastSpam, topics, simple categories
Fine-tuned BERT1,000sBest for specific taskFast inference, slow trainingSentiment, intent, domain-specific
Zero-shot LLMNoneGood general, inconsistentSlow, expensiveRapid prototyping, open-ended categories

Rule: Start with TF-IDF + LR as a baseline. If F1 > 0.85, you may not need deep learning. If lower, try fine-tuned transformers.

Class imbalance

Real-world text classification is almost always imbalanced (spam is 95% of email, fraud is 0.1% of transactions):

# Option 1: class weights
LogisticRegression(class_weight='balanced')
 
# Option 2: oversampling
from imblearn.over_sampling import RandomOverSampler
 
# Option 3: stratified splits — always use this
from sklearn.model_selection import StratifiedKFold
 
# Option 4: focal loss for transformers
# Adjust loss to focus on hard/misclassified examples

Evaluation

Multi-class:

from sklearn.metrics import classification_report, confusion_matrix
 
print(classification_report(y_true, y_pred))
# Precision, recall, F1 per class
# Macro F1 (unweighted mean), weighted F1, micro F1

Multi-label:

from sklearn.metrics import classification_report
 
# Convert to binary vectors (one-hot per label)
print(classification_report(y_true_multi, y_pred_multi, target_names=label_names))

Threshold tuning: For probability outputs, don’t assume 0.5 is optimal:

from sklearn.metrics import precision_recall_curve
 
precisions, recalls, thresholds = precision_recall_curve(y_true, y_proba)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores)]

Dataset size guidance

Task complexityData neededNotes
Binary sentiment1,000–10,000Sentiment lexicons help with little data
Multi-class topics5,000–50,000More labels = more data needed
Intent classification100–1,000 per intentFew-shot works well for intents
Domain-specific (medical, legal)10,000+Domain shift is severe
Zero/few-shot0–100Prompt engineering matters more

Practical tips

  1. Clean your labels: Inter-annotator agreement (Cohen’s κ > 0.8). Label noise is the ceiling for model performance.

  2. Stratified splits: Use StratifiedKFold to maintain class distribution across train/test/val.

  3. Text preprocessing: For BERT-level models, minimal preprocessing (just lowercase, clean whitespace). For TF-IDF, more preprocessing helps.

  4. Domain adaptation: A model trained on news won’t work for clinical notes. Fine-tune on in-domain data, even if small.

  5. Error analysis: Look at false positives and false negatives — systematic errors reveal data or label issues, not model issues.

  6. Calibration: For production, calibrate probabilities (temperature scaling) — raw logits are often overconfident.

Key papers

  • BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2019) — arXiv:1810.04805
  • Rethinking the Inception Architecture (Szegedy et al., 2016) — label smoothing
  • Revisiting Few-Shot NER* (Fritzler et al., 2020) — few-shot NER