Train-Test Split

What

Divide your dataset into separate portions for training and evaluation. The model never sees test data during training.

Why

If you evaluate on the same data you trained on, you’re measuring memorization, not generalization. The test set simulates “new, unseen data.”

Standard splits

from sklearn.model_selection import train_test_split
 
# Simple split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# With validation set (for tuning)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# Result: 70% train, 15% val, 15% test

Stratified split (for imbalanced classes)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y  # preserves class proportions
)

Data leakage — the silent killer

Leakage = information from the test set leaks into training. Your metrics look great, production performance is terrible.

Common causes:

  • Scaling/encoding fit on all data instead of train only
  • Time-series data split randomly instead of by time
  • Target encoding using the full dataset
  • Duplicate rows appearing in both train and test