Train-Test Split

What

Divide your dataset into separate portions for training and evaluation. The model never sees test data during training.

Why

If you evaluate on the same data you trained on, you’re measuring memorization, not generalization. The test set simulates “new, unseen data.”

Standard splits

from sklearn.model_selection import train_test_split
 
# Simple split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# With validation set (for tuning)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# Result: 70% train, 15% val, 15% test

Stratified split (for imbalanced classes)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y  # preserves class proportions
)

Data leakage — the silent killer

Leakage = information from the test set leaks into training. Your metrics look great, production performance is terrible.

Common causes:

Scaling/encoding fit on all data instead of train only
Time-series data split randomly instead of by time
Target encoding using the full dataset
Duplicate rows appearing in both train and test

AI/ML Notes

Explorer

Train-Test Split

Train-Test Split

What

Why

Standard splits

Stratified split (for imbalanced classes)

Data leakage — the silent killer

Links

Graph View

Table of Contents

Backlinks