Model Monitoring
What
Track model performance in production. Models degrade over time as the real world changes. If you’re not monitoring, you’re guessing.
What to monitor
- Data drift: input distribution shifts from training data
- Concept drift: the relationship between inputs and outputs changes
- Performance metrics: accuracy, latency, error rates
- Prediction distribution: are outputs still reasonable?
- Infrastructure: memory usage, GPU utilization, request throughput
Drift detection
Statistical tests to catch distribution shifts before they tank your model:
| Test | What it does | Best for |
|---|---|---|
| KS test (Kolmogorov-Smirnov) | Compares two distributions | Numerical features, univariate |
| PSI (Population Stability Index) | Measures shift in binned distributions | Categorical/binned features |
| Chi-squared | Tests independence of categorical distributions | Categorical features |
| MMD (Maximum Mean Discrepancy) | Kernel-based multivariate test | High-dimensional data |
Practical approach: compute reference statistics on your training set. For each batch of production data, run KS/PSI against reference. Alert when p-value drops below threshold or PSI > 0.2.
Deployment strategies
- Shadow deployment: new model runs alongside production, receives same traffic, but predictions are not served. Compare outputs to catch issues before they hit users
- Canary deployment: route a small % of traffic (e.g., 5%) to the new model. Monitor metrics, then gradually increase if healthy
- A/B testing: split traffic between models, measure business metrics (click-through, revenue), decide with statistical significance
Alerting strategies
Not every drift needs a page at 3am. Tier your alerts:
- P1 (immediate): model returning errors, latency spikes, prediction distribution collapses
- P2 (same day): significant data drift detected, performance below threshold
- P3 (weekly review): gradual drift trends, feature importance shifts
When to retrain
- Performance drops below a threshold
- Significant data drift detected (KS/PSI alerts)
- On a regular schedule (weekly, monthly)
- When new labeled data becomes available
- After a known external event (policy change, seasonality)
Monitoring stack
Tools to consider: Evidently AI (open-source drift detection), Prometheus + Grafana (infra metrics), custom dashboards for prediction distributions. The simplest version: log predictions to a database, run daily drift checks as a cron job.
Links
- MLOps Roadmap — the bigger picture
- Model Serving — what you’re monitoring
- ML Pipelines — automated retraining when drift detected
- Feature Stores — consistent features reduce training-serving skew