Cross-Validation
What
Instead of a single train/test split, split data k times and average the results. More reliable estimate of model performance.
K-Fold Cross-Validation
Split data into k folds. Train on k-1 folds, evaluate on the remaining one. Repeat k times.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")Variants
| Method | When to use |
|---|---|
| K-Fold (k=5 or 10) | Default, general purpose |
| Stratified K-Fold | Classification with imbalanced classes |
| Leave-One-Out | Very small datasets |
| Time Series Split | Temporal data (train on past, test on future) |
| Group K-Fold | When samples are grouped (e.g., same patient) |
What it gives you
- Mean score: expected performance on unseen data
- Std deviation: how stable the model is across different data splits
- High std = model performance depends heavily on which data it sees