ML Pipelines
What
Automate the end-to-end ML workflow: data loading → preprocessing → training → evaluation → deployment.
Why pipelines matter
- Reproducibility: same code, same data, same result every time
- Prevent data leakage: transformations fit only on training data, automatically applied to test
- Deployment simplicity: one object to serialize, load, and serve
- Less bugs: no manual step-by-step transforms to forget or reorder
sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
("scaler", StandardScaler()),
("model", RandomForestClassifier()),
])
pipe.fit(X_train, y_train)
pipe.predict(X_test) # scales + predicts in one callColumnTransformer for mixed types
Real data has numbers and categories mixed together. ColumnTransformer handles this:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
preprocess = ColumnTransformer([
("num", StandardScaler(), ["age", "income"]),
("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "gender"]),
])
pipe = Pipeline([
("preprocess", preprocess),
("model", LogisticRegression()),
])Full ML pipeline stages
Data ingestion -> Validation -> Preprocessing -> Training ->
Evaluation -> Model registry -> Deployment -> Monitoring
Beyond sklearn
sklearn pipelines are great for training. Production needs more:
| Tool | What it does | When to use |
|---|---|---|
| Airflow | DAG-based workflow orchestration | Complex multi-step pipelines, scheduling |
| Prefect | Modern Airflow alternative, Python-native | Simpler setup, better DX |
| Dagster | Data-aware orchestration with asset lineage | When data lineage matters |
| Kubeflow | ML pipelines on Kubernetes | Already on K8s, need GPU scheduling |
CI/CD for ML
ML pipelines need the same rigor as software:
- Test data assumptions: schema checks, distribution tests before training
- Automated retraining: trigger on data drift or schedule
- Model validation gates: new model must beat current production model on held-out set
- Artifact versioning: version data, code, and model together (DVC + Git)
The pattern: Git push → CI runs tests → trains model → compares to production → auto-deploys if better.
Links
- MLOps Roadmap — the bigger picture
- Experiment Tracking — logging what each pipeline run produces
- Model Serving — what happens after the pipeline
- Feature Stores — centralized feature management
- Model Monitoring — watching the deployed model