Experiment Tracking

What

A system that records every training run — hyperparameters, metrics over time, model artifacts, code version, dataset version, and hardware used. Experiment tracking is the lab notebook of ML: without it, you cannot reproduce results, compare approaches, or explain why a model behaves the way it does.

Why It Matters

“Which combination of learning rate, batch size, and architecture gave me that 94% accuracy last Tuesday?” Without tracking, you will never know. Concrete problems it solves:

  • Reproducibility: re-run any past experiment with identical settings
  • Comparison: overlay training curves of 20 runs to pick the best one
  • Debugging: a model regressed — diff its config against the last good run
  • Collaboration: teammates can see what you tried, what worked, what didn’t
  • Auditing: regulated industries require proof of how a model was trained
  • Model selection: promote the best run’s artifact to the model registry for deployment

How It Works

The Tracking Loop

Every training script follows the same pattern:

1. Start a run (gets a unique run ID)
2. Log parameters (lr, batch_size, architecture, etc.)
3. Train the model, logging metrics at each epoch
4. Log final metrics and artifacts (model weights, plots, configs)
5. End the run

The tracking server stores this as structured data you can query, sort, and visualize.

What to Track

CategoryExamplesWhy
Hyperparameterslr, batch_size, optimizer, dropout, architectureReproduce and compare runs
Metrics (per epoch)train_loss, val_loss, val_accuracy, val_f1Training curves, early stopping
Final metricstest_accuracy, test_f1, inference_latencyModel selection
Artifactsmodel weights, ONNX export, confusion matrix plotDeploy or analyze later
Data versiondataset hash, DVC version, row countDetect data drift
Code versiongit commit SHA, diffExact code that produced this model
EnvironmentPython version, package versions, GPU typeReproduce environment
Timingtotal training time, time per epochCost estimation

Key Tools

ToolHostingStrengths
MLflowSelf-hosted or DatabricksOpen-source, local-first, model registry built in
Weights & Biases (W&B)Cloud (free tier)Best visualization, automatic system metrics, sweeps
NeptuneCloudGood for teams, flexible metadata
ClearMLSelf-hosted or cloudFull pipeline orchestration beyond tracking
TensorBoardLocalLightweight, built into TensorFlow/PyTorch

MLflow for local/self-hosted workflows, W&B for best UI and collaboration. TensorBoard if you just need quick training curves.

Experiment Organization

Structure your experiments hierarchically:

  • Project: the ML problem (e.g., “fraud-detection”)
  • Experiment: a hypothesis or approach (e.g., “transformer-vs-xgboost”)
  • Run: a single training execution with specific hyperparameters

Tag runs with metadata: {"model_type": "resnet18", "dataset": "v2.3", "gpu": "A100"}. This makes filtering and comparison easy.

Code Example

MLflow: Full Training Loop

import mlflow
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
 
# --- Generate data ---
X, y = make_classification(n_samples=2000, n_features=20,
                           n_classes=3, n_informative=10,
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
train_ds = TensorDataset(torch.tensor(X_train, dtype=torch.float32),
                         torch.tensor(y_train, dtype=torch.long))
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
 
# --- Model ---
model = nn.Sequential(
    nn.Linear(20, 64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64, 32), nn.ReLU(),
    nn.Linear(32, 3),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
 
# --- Track with MLflow ---
mlflow.set_experiment("classification-experiment")
 
with mlflow.start_run(run_name="mlp-baseline"):
    # Log hyperparameters
    mlflow.log_params({
        "lr": 1e-3,
        "batch_size": 64,
        "epochs": 30,
        "architecture": "MLP-64-32",
        "dropout": 0.3,
        "optimizer": "Adam",
    })
 
    # Training loop
    for epoch in range(30):
        model.train()
        epoch_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(X_batch), y_batch)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
 
        avg_loss = epoch_loss / len(train_loader)
        mlflow.log_metric("train_loss", avg_loss, step=epoch)
 
        # Validation
        model.eval()
        with torch.no_grad():
            X_t = torch.tensor(X_test, dtype=torch.float32)
            preds = model(X_t).argmax(dim=1).numpy()
            acc = (preds == y_test).mean()
            mlflow.log_metric("val_accuracy", acc, step=epoch)
 
    # Log final metrics
    mlflow.log_metric("final_accuracy", acc)
 
    # Log model artifact
    torch.save(model.state_dict(), "model.pt")
    mlflow.log_artifact("model.pt")
 
    # Log as MLflow model (enables registry + serving)
    mlflow.pytorch.log_model(model, "model")
 
print(f"Final accuracy: {acc:.3f}")
# View results
mlflow ui  # opens at http://localhost:5000

W&B: Same Loop (Alternative)

import wandb
 
wandb.init(project="classification", name="mlp-baseline",
           config={"lr": 1e-3, "epochs": 30, "architecture": "MLP-64-32"})
 
for epoch in range(30):
    # ... training code ...
    wandb.log({"train_loss": avg_loss, "val_accuracy": acc, "epoch": epoch})
 
wandb.finish()

Key Tradeoffs

DecisionOption AOption B
HostingSelf-hosted MLflow (control, privacy)Cloud W&B (better UI, zero ops)
GranularityLog every batch (detailed)Log every epoch (less storage)
ArtifactsLog all checkpoints (resume from any point)Log best only (less storage)
Auto-loggingmlflow.autolog() (easy, noisy)Manual logging (precise, more code)

Common Pitfalls

  • Not tracking data version: you changed the dataset but didn’t record it. Now you can’t reproduce the best run. Hash your data or use DVC
  • Logging too little: “I’ll remember the settings” — you won’t. Log everything, filter later
  • Logging too much: saving 100 GB of checkpoints when you only need the best one. Log checkpoints on improvement only
  • No naming convention: 500 runs named “test”, “test2”, “final”, “final_v2”. Use structured names: resnet18-lr0.001-bs64
  • Ignoring failed runs: failed runs tell you what doesn’t work. Don’t delete them

Exercises

  1. Set up MLflow locally. Train the same model with 5 different learning rates (1e-2, 1e-3, 1e-4, 3e-3, 3e-4). Use the MLflow UI to compare training curves and pick the best
  2. Add mlflow.autolog() to a PyTorch training script. Compare what it logs automatically versus what you’d log manually. What’s missing?
  3. Log a confusion matrix as an artifact (save as PNG with matplotlib, then mlflow.log_artifact)
  4. Implement a callback that logs GPU memory usage and training speed (samples/sec) per epoch

Self-Test Questions

  1. What is the difference between a parameter and a metric in experiment tracking? Give examples of each
  2. Why should you log the git commit hash alongside your experiment? What problem does it prevent?
  3. How would you compare 50 runs to find the best hyperparameter combination? What UI features help?
  4. When would you choose MLflow over W&B, and vice versa?
  5. What is the risk of mlflow.autolog() compared to manual logging?