ML Pipelines

What

Automate the end-to-end ML workflow: data loading → preprocessing → training → evaluation → deployment.

Why pipelines matter

Reproducibility: same code, same data, same result every time
Prevent data leakage: transformations fit only on training data, automatically applied to test
Deployment simplicity: one object to serialize, load, and serve
Less bugs: no manual step-by-step transforms to forget or reorder

sklearn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
 
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier()),
])
 
pipe.fit(X_train, y_train)
pipe.predict(X_test)  # scales + predicts in one call

ColumnTransformer for mixed types

Real data has numbers and categories mixed together. ColumnTransformer handles this:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
 
preprocess = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "gender"]),
])
 
pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", LogisticRegression()),
])

Full ML pipeline stages

Data ingestion -> Validation -> Preprocessing -> Training ->
Evaluation -> Model registry -> Deployment -> Monitoring

Beyond sklearn

sklearn pipelines are great for training. Production needs more:

Tool	What it does	When to use
Airflow	DAG-based workflow orchestration	Complex multi-step pipelines, scheduling
Prefect	Modern Airflow alternative, Python-native	Simpler setup, better DX
Dagster	Data-aware orchestration with asset lineage	When data lineage matters
Kubeflow	ML pipelines on Kubernetes	Already on K8s, need GPU scheduling

CI/CD for ML

ML pipelines need the same rigor as software:

Test data assumptions: schema checks, distribution tests before training
Automated retraining: trigger on data drift or schedule
Model validation gates: new model must beat current production model on held-out set
Artifact versioning: version data, code, and model together (DVC + Git)

The pattern: Git push → CI runs tests → trains model → compares to production → auto-deploys if better.

AI/ML Notes

Explorer

ML Pipelines

ML Pipelines

What

Why pipelines matter

sklearn Pipeline

ColumnTransformer for mixed types

Full ML pipeline stages

Beyond sklearn

CI/CD for ML

Links

Graph View

Table of Contents

Backlinks