Cross-Validation Done Right
Goal: Implement K-Fold CV from scratch, understand stratification, detect data leakage, and use nested CV for honest model comparison.
Prerequisites: Cross-Validation, Train-Test Split, Evaluation Metrics, Bias-Variance Tradeoff
Why Train/Test Split Isn’t Enough
A single 80/20 split is noisy — you might get lucky or unlucky. Cross-validation averages over multiple splits for a more reliable estimate.
K-Fold from Scratch
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
def kfold_split(n_samples, k=5, shuffle=True, seed=42):
"""Generate k train/test index pairs."""
indices = np.arange(n_samples)
if shuffle:
np.random.RandomState(seed).shuffle(indices)
fold_size = n_samples // k
for i in range(k):
test_start = i * fold_size
test_end = test_start + fold_size if i < k - 1 else n_samples
test_idx = indices[test_start:test_end]
train_idx = np.concatenate([indices[:test_start], indices[test_end:]])
yield train_idx, test_idx
# Test
X, y = load_iris(return_X_y=True)
scores = []
for fold, (train_idx, test_idx) in enumerate(kfold_split(len(X), k=5)):
model = LogisticRegression(max_iter=200)
model.fit(X[train_idx], y[train_idx])
score = accuracy_score(y[test_idx], model.predict(X[test_idx]))
scores.append(score)
print(f"Fold {fold}: {score:.4f}")
print(f"\nMean: {np.mean(scores):.4f} ± {np.std(scores):.4f}")Stratified K-Fold
Regular K-Fold can put all samples of a rare class in one fold. Stratification ensures each fold has the same class distribution:
def stratified_kfold(y, k=5, seed=42):
"""Stratified K-Fold — preserves class distribution in each fold."""
rng = np.random.RandomState(seed)
classes = np.unique(y)
fold_indices = [[] for _ in range(k)]
for cls in classes:
cls_indices = np.where(y == cls)[0]
rng.shuffle(cls_indices)
for i, idx in enumerate(cls_indices):
fold_indices[i % k].append(idx)
for i in range(k):
test_idx = np.array(fold_indices[i])
train_idx = np.concatenate([fold_indices[j] for j in range(k) if j != i])
yield train_idx, test_idx
# Verify class distribution is preserved
print("Class distribution per fold:")
for fold, (train_idx, test_idx) in enumerate(stratified_kfold(y, k=5)):
counts = np.bincount(y[test_idx])
print(f" Fold {fold}: {counts}")When stratification matters
# Imbalanced dataset
from sklearn.datasets import make_classification
X_imb, y_imb = make_classification(n_samples=200, weights=[0.9, 0.1], random_state=42)
print(f"Class distribution: {np.bincount(y_imb)}")
# Without stratification — some folds may have 0 positive samples
for fold, (tr, te) in enumerate(kfold_split(len(X_imb), k=5)):
print(f" Regular fold {fold}: test positives = {y_imb[te].sum()}")
# With stratification — consistent
for fold, (tr, te) in enumerate(stratified_kfold(y_imb, k=5)):
print(f" Stratified fold {fold}: test positives = {y_imb[te].sum()}")Data Leakage Demo
Leakage = information from the test set leaks into training. The model looks great during CV but fails in production.
Leakage Example: Scaling Before Splitting
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
# WRONG: scale on ALL data, then cross-validate
X_scaled_wrong = StandardScaler().fit_transform(X)
scores_wrong = cross_val_score(LogisticRegression(max_iter=200), X_scaled_wrong, y, cv=5)
# RIGHT: scale inside each fold
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=200))
scores_right = cross_val_score(pipe, X, y, cv=5)
print(f"Leaked: {scores_wrong.mean():.4f} ± {scores_wrong.std():.4f}")
print(f"Correct: {scores_right.mean():.4f} ± {scores_right.std():.4f}")On Iris the difference is small. On messy real data with outliers, leakage can inflate scores by 5-15%.
Common Leakage Sources
# 1. Feature selection on full data → information about test labels leaks
# WRONG:
from sklearn.feature_selection import SelectKBest, f_classif
X_selected = SelectKBest(k=2).fit_transform(X, y) # uses ALL labels!
cross_val_score(LogisticRegression(max_iter=200), X_selected, y, cv=5)
# RIGHT: pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("select", SelectKBest(k=2)),
("model", LogisticRegression(max_iter=200)),
])
cross_val_score(pipe, X, y, cv=5)# 2. Target encoding on full data → target values leak
# WRONG: encode using all target values
# RIGHT: encode inside each fold using only train targets
# 3. Time series split by random shuffle → future data leaks into past
# WRONG: random K-Fold on time series
# RIGHT: use TimeSeriesSplit (or rolling window)Rule: Anything that uses information from the test set — including statistics like mean, std, feature importances, or target values — must be computed inside the CV loop on training data only.
Nested Cross-Validation
Problem: if you use CV to both select hyperparameters and estimate performance, you’re optimistically biased.
Solution: two loops — inner for tuning, outer for evaluation.
from sklearn.model_selection import GridSearchCV, StratifiedKFold
X, y = make_classification(n_samples=500, n_features=20, random_state=42)
# Nested CV
outer_scores = []
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, test_idx) in enumerate(outer_cv.split(X, y)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Inner loop: tune hyperparameters on training data
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
param_grid = {"C": [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid,
cv=inner_cv, scoring="accuracy")
grid.fit(X_train, y_train)
# Outer loop: evaluate on held-out test
score = grid.score(X_test, y_test)
outer_scores.append(score)
print(f"Fold {fold}: best_C={grid.best_params_['C']}, test_acc={score:.4f}")
print(f"\nNested CV estimate: {np.mean(outer_scores):.4f} ± {np.std(outer_scores):.4f}")Non-nested (biased) comparison
# This is optimistically biased — same data used for tuning AND evaluation
grid_biased = GridSearchCV(LogisticRegression(max_iter=200),
{"C": [0.01, 0.1, 1, 10, 100]},
cv=5, scoring="accuracy")
grid_biased.fit(X, y)
print(f"Non-nested best score: {grid_biased.best_score_:.4f}")
print(f"Nested estimate: {np.mean(outer_scores):.4f}")
print("(Non-nested is almost always higher — that's the bias)")Time Series Cross-Validation
Never shuffle time series. Use expanding or sliding windows:
def time_series_cv(n_samples, n_splits=5, min_train=50):
"""Expanding window CV for time series."""
test_size = (n_samples - min_train) // n_splits
for i in range(n_splits):
train_end = min_train + i * test_size
test_end = train_end + test_size
if test_end > n_samples:
break
yield np.arange(train_end), np.arange(train_end, test_end)
# Visualize splits
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
fig, ax = plt.subplots(figsize=(12, 3))
n = 200
for fold, (train_idx, test_idx) in enumerate(time_series_cv(n, n_splits=5, min_train=50)):
ax.barh(fold, len(train_idx), left=0, color="steelblue", height=0.8)
ax.barh(fold, len(test_idx), left=len(train_idx), color="coral", height=0.8)
ax.set_xlabel("Sample index"); ax.set_ylabel("Fold")
ax.set_title("Time series CV — train always before test")
ax.legend(handles=[mpatches.Patch(color="steelblue", label="Train"),
mpatches.Patch(color="coral", label="Test")])
plt.show()CV Summary Table
| Scenario | Method | Why |
|---|---|---|
| Balanced classification | Stratified K-Fold | Preserve class ratios |
| Imbalanced classification | Stratified K-Fold | Prevent empty folds |
| Regression | K-Fold | No classes to stratify |
| Time series | TimeSeriesSplit | Prevent future leakage |
| Small dataset | Leave-One-Out | Maximum training data |
| Hyperparameter tuning + evaluation | Nested CV | Unbiased estimate |
Exercises
-
Repeated K-Fold: Run 5-fold CV 10 times with different random seeds. Plot the distribution of scores. How much variance is there?
-
Leave-One-Out: Implement LOO CV (n folds, each with 1 test sample). Compare the estimate with 5-fold on Iris. LOO has lower bias but higher variance.
-
Group K-Fold: If data has groups (e.g., multiple measurements per patient), all data from one group must be in the same fold. Implement this.
-
Leakage detector: Create a function that takes a pipeline and checks if scaling/encoding happens before or after the split. Flag potential leakage.
Next: 11 - Handling Class Imbalance — what to do when 95% of your data is one class.