Experiment Tracking
What
A system that records every training run — hyperparameters, metrics over time, model artifacts, code version, dataset version, and hardware used. Experiment tracking is the lab notebook of ML: without it, you cannot reproduce results, compare approaches, or explain why a model behaves the way it does.
Why It Matters
“Which combination of learning rate, batch size, and architecture gave me that 94% accuracy last Tuesday?” Without tracking, you will never know. Concrete problems it solves:
- Reproducibility: re-run any past experiment with identical settings
- Comparison: overlay training curves of 20 runs to pick the best one
- Debugging: a model regressed — diff its config against the last good run
- Collaboration: teammates can see what you tried, what worked, what didn’t
- Auditing: regulated industries require proof of how a model was trained
- Model selection: promote the best run’s artifact to the model registry for deployment
How It Works
The Tracking Loop
Every training script follows the same pattern:
1. Start a run (gets a unique run ID)
2. Log parameters (lr, batch_size, architecture, etc.)
3. Train the model, logging metrics at each epoch
4. Log final metrics and artifacts (model weights, plots, configs)
5. End the run
The tracking server stores this as structured data you can query, sort, and visualize.
What to Track
| Category | Examples | Why |
|---|---|---|
| Hyperparameters | lr, batch_size, optimizer, dropout, architecture | Reproduce and compare runs |
| Metrics (per epoch) | train_loss, val_loss, val_accuracy, val_f1 | Training curves, early stopping |
| Final metrics | test_accuracy, test_f1, inference_latency | Model selection |
| Artifacts | model weights, ONNX export, confusion matrix plot | Deploy or analyze later |
| Data version | dataset hash, DVC version, row count | Detect data drift |
| Code version | git commit SHA, diff | Exact code that produced this model |
| Environment | Python version, package versions, GPU type | Reproduce environment |
| Timing | total training time, time per epoch | Cost estimation |
Key Tools
| Tool | Hosting | Strengths |
|---|---|---|
| MLflow | Self-hosted or Databricks | Open-source, local-first, model registry built in |
| Weights & Biases (W&B) | Cloud (free tier) | Best visualization, automatic system metrics, sweeps |
| Neptune | Cloud | Good for teams, flexible metadata |
| ClearML | Self-hosted or cloud | Full pipeline orchestration beyond tracking |
| TensorBoard | Local | Lightweight, built into TensorFlow/PyTorch |
MLflow for local/self-hosted workflows, W&B for best UI and collaboration. TensorBoard if you just need quick training curves.
Experiment Organization
Structure your experiments hierarchically:
- Project: the ML problem (e.g., “fraud-detection”)
- Experiment: a hypothesis or approach (e.g., “transformer-vs-xgboost”)
- Run: a single training execution with specific hyperparameters
Tag runs with metadata: {"model_type": "resnet18", "dataset": "v2.3", "gpu": "A100"}. This makes filtering and comparison easy.
Code Example
MLflow: Full Training Loop
import mlflow
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
# --- Generate data ---
X, y = make_classification(n_samples=2000, n_features=20,
n_classes=3, n_informative=10,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
train_ds = TensorDataset(torch.tensor(X_train, dtype=torch.float32),
torch.tensor(y_train, dtype=torch.long))
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
# --- Model ---
model = nn.Sequential(
nn.Linear(20, 64), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 3),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# --- Track with MLflow ---
mlflow.set_experiment("classification-experiment")
with mlflow.start_run(run_name="mlp-baseline"):
# Log hyperparameters
mlflow.log_params({
"lr": 1e-3,
"batch_size": 64,
"epochs": 30,
"architecture": "MLP-64-32",
"dropout": 0.3,
"optimizer": "Adam",
})
# Training loop
for epoch in range(30):
model.train()
epoch_loss = 0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
loss = criterion(model(X_batch), y_batch)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / len(train_loader)
mlflow.log_metric("train_loss", avg_loss, step=epoch)
# Validation
model.eval()
with torch.no_grad():
X_t = torch.tensor(X_test, dtype=torch.float32)
preds = model(X_t).argmax(dim=1).numpy()
acc = (preds == y_test).mean()
mlflow.log_metric("val_accuracy", acc, step=epoch)
# Log final metrics
mlflow.log_metric("final_accuracy", acc)
# Log model artifact
torch.save(model.state_dict(), "model.pt")
mlflow.log_artifact("model.pt")
# Log as MLflow model (enables registry + serving)
mlflow.pytorch.log_model(model, "model")
print(f"Final accuracy: {acc:.3f}")# View results
mlflow ui # opens at http://localhost:5000W&B: Same Loop (Alternative)
import wandb
wandb.init(project="classification", name="mlp-baseline",
config={"lr": 1e-3, "epochs": 30, "architecture": "MLP-64-32"})
for epoch in range(30):
# ... training code ...
wandb.log({"train_loss": avg_loss, "val_accuracy": acc, "epoch": epoch})
wandb.finish()Key Tradeoffs
| Decision | Option A | Option B |
|---|---|---|
| Hosting | Self-hosted MLflow (control, privacy) | Cloud W&B (better UI, zero ops) |
| Granularity | Log every batch (detailed) | Log every epoch (less storage) |
| Artifacts | Log all checkpoints (resume from any point) | Log best only (less storage) |
| Auto-logging | mlflow.autolog() (easy, noisy) | Manual logging (precise, more code) |
Common Pitfalls
- Not tracking data version: you changed the dataset but didn’t record it. Now you can’t reproduce the best run. Hash your data or use DVC
- Logging too little: “I’ll remember the settings” — you won’t. Log everything, filter later
- Logging too much: saving 100 GB of checkpoints when you only need the best one. Log checkpoints on improvement only
- No naming convention: 500 runs named “test”, “test2”, “final”, “final_v2”. Use structured names:
resnet18-lr0.001-bs64 - Ignoring failed runs: failed runs tell you what doesn’t work. Don’t delete them
Exercises
- Set up MLflow locally. Train the same model with 5 different learning rates (1e-2, 1e-3, 1e-4, 3e-3, 3e-4). Use the MLflow UI to compare training curves and pick the best
- Add
mlflow.autolog()to a PyTorch training script. Compare what it logs automatically versus what you’d log manually. What’s missing? - Log a confusion matrix as an artifact (save as PNG with matplotlib, then
mlflow.log_artifact) - Implement a callback that logs GPU memory usage and training speed (samples/sec) per epoch
Self-Test Questions
- What is the difference between a parameter and a metric in experiment tracking? Give examples of each
- Why should you log the git commit hash alongside your experiment? What problem does it prevent?
- How would you compare 50 runs to find the best hyperparameter combination? What UI features help?
- When would you choose MLflow over W&B, and vice versa?
- What is the risk of
mlflow.autolog()compared to manual logging?
Links
- MLOps Roadmap
- Hyperparameter Tuning — what you’re tracking experiments for
- ML Pipelines — automated training that logs experiments
- Model Serving — deploying the best tracked model
- Model Monitoring — tracking continues after deployment