Cross-Validation

What

Instead of a single train/test split, split data k times and average the results. More reliable estimate of model performance.

K-Fold Cross-Validation

Split data into k folds. Train on k-1 folds, evaluate on the remaining one. Repeat k times.

from sklearn.model_selection import cross_val_score
 
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")

Variants

MethodWhen to use
K-Fold (k=5 or 10)Default, general purpose
Stratified K-FoldClassification with imbalanced classes
Leave-One-OutVery small datasets
Time Series SplitTemporal data (train on past, test on future)
Group K-FoldWhen samples are grouped (e.g., same patient)

What it gives you

  • Mean score: expected performance on unseen data
  • Std deviation: how stable the model is across different data splits
  • High std = model performance depends heavily on which data it sees