Cross-Validation Done Right

Goal: Implement K-Fold CV from scratch, understand stratification, detect data leakage, and use nested CV for honest model comparison.

Prerequisites: Cross-Validation, Train-Test Split, Evaluation Metrics, Bias-Variance Tradeoff


Why Train/Test Split Isn’t Enough

A single 80/20 split is noisy — you might get lucky or unlucky. Cross-validation averages over multiple splits for a more reliable estimate.


K-Fold from Scratch

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
 
def kfold_split(n_samples, k=5, shuffle=True, seed=42):
    """Generate k train/test index pairs."""
    indices = np.arange(n_samples)
    if shuffle:
        np.random.RandomState(seed).shuffle(indices)
 
    fold_size = n_samples // k
    for i in range(k):
        test_start = i * fold_size
        test_end = test_start + fold_size if i < k - 1 else n_samples
        test_idx = indices[test_start:test_end]
        train_idx = np.concatenate([indices[:test_start], indices[test_end:]])
        yield train_idx, test_idx
 
# Test
X, y = load_iris(return_X_y=True)
scores = []
for fold, (train_idx, test_idx) in enumerate(kfold_split(len(X), k=5)):
    model = LogisticRegression(max_iter=200)
    model.fit(X[train_idx], y[train_idx])
    score = accuracy_score(y[test_idx], model.predict(X[test_idx]))
    scores.append(score)
    print(f"Fold {fold}: {score:.4f}")
 
print(f"\nMean: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

Stratified K-Fold

Regular K-Fold can put all samples of a rare class in one fold. Stratification ensures each fold has the same class distribution:

def stratified_kfold(y, k=5, seed=42):
    """Stratified K-Fold — preserves class distribution in each fold."""
    rng = np.random.RandomState(seed)
    classes = np.unique(y)
    fold_indices = [[] for _ in range(k)]
 
    for cls in classes:
        cls_indices = np.where(y == cls)[0]
        rng.shuffle(cls_indices)
        for i, idx in enumerate(cls_indices):
            fold_indices[i % k].append(idx)
 
    for i in range(k):
        test_idx = np.array(fold_indices[i])
        train_idx = np.concatenate([fold_indices[j] for j in range(k) if j != i])
        yield train_idx, test_idx
 
# Verify class distribution is preserved
print("Class distribution per fold:")
for fold, (train_idx, test_idx) in enumerate(stratified_kfold(y, k=5)):
    counts = np.bincount(y[test_idx])
    print(f"  Fold {fold}: {counts}")

When stratification matters

# Imbalanced dataset
from sklearn.datasets import make_classification
X_imb, y_imb = make_classification(n_samples=200, weights=[0.9, 0.1], random_state=42)
print(f"Class distribution: {np.bincount(y_imb)}")
 
# Without stratification — some folds may have 0 positive samples
for fold, (tr, te) in enumerate(kfold_split(len(X_imb), k=5)):
    print(f"  Regular fold {fold}: test positives = {y_imb[te].sum()}")
 
# With stratification — consistent
for fold, (tr, te) in enumerate(stratified_kfold(y_imb, k=5)):
    print(f"  Stratified fold {fold}: test positives = {y_imb[te].sum()}")

Data Leakage Demo

Leakage = information from the test set leaks into training. The model looks great during CV but fails in production.

Leakage Example: Scaling Before Splitting

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
 
# WRONG: scale on ALL data, then cross-validate
X_scaled_wrong = StandardScaler().fit_transform(X)
scores_wrong = cross_val_score(LogisticRegression(max_iter=200), X_scaled_wrong, y, cv=5)
 
# RIGHT: scale inside each fold
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=200))
scores_right = cross_val_score(pipe, X, y, cv=5)
 
print(f"Leaked:  {scores_wrong.mean():.4f} ± {scores_wrong.std():.4f}")
print(f"Correct: {scores_right.mean():.4f} ± {scores_right.std():.4f}")

On Iris the difference is small. On messy real data with outliers, leakage can inflate scores by 5-15%.

Common Leakage Sources

# 1. Feature selection on full data → information about test labels leaks
# WRONG:
from sklearn.feature_selection import SelectKBest, f_classif
X_selected = SelectKBest(k=2).fit_transform(X, y)  # uses ALL labels!
cross_val_score(LogisticRegression(max_iter=200), X_selected, y, cv=5)
 
# RIGHT: pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ("select", SelectKBest(k=2)),
    ("model", LogisticRegression(max_iter=200)),
])
cross_val_score(pipe, X, y, cv=5)
# 2. Target encoding on full data → target values leak
# WRONG: encode using all target values
# RIGHT: encode inside each fold using only train targets
 
# 3. Time series split by random shuffle → future data leaks into past
# WRONG: random K-Fold on time series
# RIGHT: use TimeSeriesSplit (or rolling window)

Rule: Anything that uses information from the test set — including statistics like mean, std, feature importances, or target values — must be computed inside the CV loop on training data only.


Nested Cross-Validation

Problem: if you use CV to both select hyperparameters and estimate performance, you’re optimistically biased.

Solution: two loops — inner for tuning, outer for evaluation.

from sklearn.model_selection import GridSearchCV, StratifiedKFold
 
X, y = make_classification(n_samples=500, n_features=20, random_state=42)
 
# Nested CV
outer_scores = []
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
 
for fold, (train_idx, test_idx) in enumerate(outer_cv.split(X, y)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
 
    # Inner loop: tune hyperparameters on training data
    inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    param_grid = {"C": [0.01, 0.1, 1, 10, 100]}
    grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid,
                        cv=inner_cv, scoring="accuracy")
    grid.fit(X_train, y_train)
 
    # Outer loop: evaluate on held-out test
    score = grid.score(X_test, y_test)
    outer_scores.append(score)
    print(f"Fold {fold}: best_C={grid.best_params_['C']}, test_acc={score:.4f}")
 
print(f"\nNested CV estimate: {np.mean(outer_scores):.4f} ± {np.std(outer_scores):.4f}")

Non-nested (biased) comparison

# This is optimistically biased — same data used for tuning AND evaluation
grid_biased = GridSearchCV(LogisticRegression(max_iter=200),
                           {"C": [0.01, 0.1, 1, 10, 100]},
                           cv=5, scoring="accuracy")
grid_biased.fit(X, y)
print(f"Non-nested best score: {grid_biased.best_score_:.4f}")
print(f"Nested estimate:       {np.mean(outer_scores):.4f}")
print("(Non-nested is almost always higher — that's the bias)")

Time Series Cross-Validation

Never shuffle time series. Use expanding or sliding windows:

def time_series_cv(n_samples, n_splits=5, min_train=50):
    """Expanding window CV for time series."""
    test_size = (n_samples - min_train) // n_splits
    for i in range(n_splits):
        train_end = min_train + i * test_size
        test_end = train_end + test_size
        if test_end > n_samples:
            break
        yield np.arange(train_end), np.arange(train_end, test_end)
 
# Visualize splits
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
 
fig, ax = plt.subplots(figsize=(12, 3))
n = 200
for fold, (train_idx, test_idx) in enumerate(time_series_cv(n, n_splits=5, min_train=50)):
    ax.barh(fold, len(train_idx), left=0, color="steelblue", height=0.8)
    ax.barh(fold, len(test_idx), left=len(train_idx), color="coral", height=0.8)
ax.set_xlabel("Sample index"); ax.set_ylabel("Fold")
ax.set_title("Time series CV — train always before test")
ax.legend(handles=[mpatches.Patch(color="steelblue", label="Train"),
                   mpatches.Patch(color="coral", label="Test")])
plt.show()

CV Summary Table

ScenarioMethodWhy
Balanced classificationStratified K-FoldPreserve class ratios
Imbalanced classificationStratified K-FoldPrevent empty folds
RegressionK-FoldNo classes to stratify
Time seriesTimeSeriesSplitPrevent future leakage
Small datasetLeave-One-OutMaximum training data
Hyperparameter tuning + evaluationNested CVUnbiased estimate

Exercises

  1. Repeated K-Fold: Run 5-fold CV 10 times with different random seeds. Plot the distribution of scores. How much variance is there?

  2. Leave-One-Out: Implement LOO CV (n folds, each with 1 test sample). Compare the estimate with 5-fold on Iris. LOO has lower bias but higher variance.

  3. Group K-Fold: If data has groups (e.g., multiple measurements per patient), all data from one group must be in the same fold. Implement this.

  4. Leakage detector: Create a function that takes a pipeline and checks if scaling/encoding happens before or after the split. Flag potential leakage.


Next: 11 - Handling Class Imbalance — what to do when 95% of your data is one class.