Train-Test Split
What
Divide your dataset into separate portions for training and evaluation. The model never sees test data during training.
Why
If you evaluate on the same data you trained on, you’re measuring memorization, not generalization. The test set simulates “new, unseen data.”
Standard splits
from sklearn.model_selection import train_test_split
# Simple split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# With validation set (for tuning)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# Result: 70% train, 15% val, 15% testStratified split (for imbalanced classes)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y # preserves class proportions
)Data leakage — the silent killer
Leakage = information from the test set leaks into training. Your metrics look great, production performance is terrible.
Common causes:
- Scaling/encoding fit on all data instead of train only
- Time-series data split randomly instead of by time
- Target encoding using the full dataset
- Duplicate rows appearing in both train and test