ML Pipelines

What

Automate the end-to-end ML workflow: data loading preprocessing training evaluation deployment.

Why pipelines matter

  • Reproducibility: same code, same data, same result every time
  • Prevent data leakage: transformations fit only on training data, automatically applied to test
  • Deployment simplicity: one object to serialize, load, and serve
  • Less bugs: no manual step-by-step transforms to forget or reorder

sklearn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
 
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier()),
])
 
pipe.fit(X_train, y_train)
pipe.predict(X_test)  # scales + predicts in one call

ColumnTransformer for mixed types

Real data has numbers and categories mixed together. ColumnTransformer handles this:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
 
preprocess = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "gender"]),
])
 
pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", LogisticRegression()),
])

Full ML pipeline stages

Data ingestion -> Validation -> Preprocessing -> Training ->
Evaluation -> Model registry -> Deployment -> Monitoring

Beyond sklearn

sklearn pipelines are great for training. Production needs more:

ToolWhat it doesWhen to use
AirflowDAG-based workflow orchestrationComplex multi-step pipelines, scheduling
PrefectModern Airflow alternative, Python-nativeSimpler setup, better DX
DagsterData-aware orchestration with asset lineageWhen data lineage matters
KubeflowML pipelines on KubernetesAlready on K8s, need GPU scheduling

CI/CD for ML

ML pipelines need the same rigor as software:

  • Test data assumptions: schema checks, distribution tests before training
  • Automated retraining: trigger on data drift or schedule
  • Model validation gates: new model must beat current production model on held-out set
  • Artifact versioning: version data, code, and model together (DVC + Git)

The pattern: Git push CI runs tests trains model compares to production auto-deploys if better.