Model Monitoring

What

Track model performance in production. Models degrade over time as the real world changes. If you’re not monitoring, you’re guessing.

What to monitor

  • Data drift: input distribution shifts from training data
  • Concept drift: the relationship between inputs and outputs changes
  • Performance metrics: accuracy, latency, error rates
  • Prediction distribution: are outputs still reasonable?
  • Infrastructure: memory usage, GPU utilization, request throughput

Drift detection

Statistical tests to catch distribution shifts before they tank your model:

TestWhat it doesBest for
KS test (Kolmogorov-Smirnov)Compares two distributionsNumerical features, univariate
PSI (Population Stability Index)Measures shift in binned distributionsCategorical/binned features
Chi-squaredTests independence of categorical distributionsCategorical features
MMD (Maximum Mean Discrepancy)Kernel-based multivariate testHigh-dimensional data

Practical approach: compute reference statistics on your training set. For each batch of production data, run KS/PSI against reference. Alert when p-value drops below threshold or PSI > 0.2.

Deployment strategies

  • Shadow deployment: new model runs alongside production, receives same traffic, but predictions are not served. Compare outputs to catch issues before they hit users
  • Canary deployment: route a small % of traffic (e.g., 5%) to the new model. Monitor metrics, then gradually increase if healthy
  • A/B testing: split traffic between models, measure business metrics (click-through, revenue), decide with statistical significance

Alerting strategies

Not every drift needs a page at 3am. Tier your alerts:

  • P1 (immediate): model returning errors, latency spikes, prediction distribution collapses
  • P2 (same day): significant data drift detected, performance below threshold
  • P3 (weekly review): gradual drift trends, feature importance shifts

When to retrain

  • Performance drops below a threshold
  • Significant data drift detected (KS/PSI alerts)
  • On a regular schedule (weekly, monthly)
  • When new labeled data becomes available
  • After a known external event (policy change, seasonality)

Monitoring stack

Tools to consider: Evidently AI (open-source drift detection), Prometheus + Grafana (infra metrics), custom dashboards for prediction distributions. The simplest version: log predictions to a database, run daily drift checks as a cron job.