Experiment Tracking

What

A system that records every training run — hyperparameters, metrics over time, model artifacts, code version, dataset version, and hardware used. Experiment tracking is the lab notebook of ML: without it, you cannot reproduce results, compare approaches, or explain why a model behaves the way it does.

Why It Matters

“Which combination of learning rate, batch size, and architecture gave me that 94% accuracy last Tuesday?” Without tracking, you will never know. Concrete problems it solves:

Reproducibility: re-run any past experiment with identical settings
Comparison: overlay training curves of 20 runs to pick the best one
Debugging: a model regressed — diff its config against the last good run
Collaboration: teammates can see what you tried, what worked, what didn’t
Auditing: regulated industries require proof of how a model was trained
Model selection: promote the best run’s artifact to the model registry for deployment

How It Works

The Tracking Loop

Every training script follows the same pattern:

1. Start a run (gets a unique run ID)
2. Log parameters (lr, batch_size, architecture, etc.)
3. Train the model, logging metrics at each epoch
4. Log final metrics and artifacts (model weights, plots, configs)
5. End the run

The tracking server stores this as structured data you can query, sort, and visualize.

What to Track

Category	Examples	Why
Hyperparameters	lr, batch_size, optimizer, dropout, architecture	Reproduce and compare runs
Metrics (per epoch)	train_loss, val_loss, val_accuracy, val_f1	Training curves, early stopping
Final metrics	test_accuracy, test_f1, inference_latency	Model selection
Artifacts	model weights, ONNX export, confusion matrix plot	Deploy or analyze later
Data version	dataset hash, DVC version, row count	Detect data drift
Code version	git commit SHA, diff	Exact code that produced this model
Environment	Python version, package versions, GPU type	Reproduce environment
Timing	total training time, time per epoch	Cost estimation

Key Tools

Tool	Hosting	Strengths
MLflow	Self-hosted or Databricks	Open-source, local-first, model registry built in
Weights & Biases (W&B)	Cloud (free tier)	Best visualization, automatic system metrics, sweeps
Neptune	Cloud	Good for teams, flexible metadata
ClearML	Self-hosted or cloud	Full pipeline orchestration beyond tracking
TensorBoard	Local	Lightweight, built into TensorFlow/PyTorch

MLflow for local/self-hosted workflows, W&B for best UI and collaboration. TensorBoard if you just need quick training curves.

Experiment Organization

Structure your experiments hierarchically:

Project: the ML problem (e.g., “fraud-detection”)
Experiment: a hypothesis or approach (e.g., “transformer-vs-xgboost”)
Run: a single training execution with specific hyperparameters

Tag runs with metadata: {"model_type": "resnet18", "dataset": "v2.3", "gpu": "A100"}. This makes filtering and comparison easy.

Code Example

MLflow: Full Training Loop

import mlflow
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
 
# --- Generate data ---
X, y = make_classification(n_samples=2000, n_features=20,
                           n_classes=3, n_informative=10,
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
train_ds = TensorDataset(torch.tensor(X_train, dtype=torch.float32),
                         torch.tensor(y_train, dtype=torch.long))
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
 
# --- Model ---
model = nn.Sequential(
    nn.Linear(20, 64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64, 32), nn.ReLU(),
    nn.Linear(32, 3),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
 
# --- Track with MLflow ---
mlflow.set_experiment("classification-experiment")
 
with mlflow.start_run(run_name="mlp-baseline"):
    # Log hyperparameters
    mlflow.log_params({
        "lr": 1e-3,
        "batch_size": 64,
        "epochs": 30,
        "architecture": "MLP-64-32",
        "dropout": 0.3,
        "optimizer": "Adam",
    })
 
    # Training loop
    for epoch in range(30):
        model.train()
        epoch_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(X_batch), y_batch)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
 
        avg_loss = epoch_loss / len(train_loader)
        mlflow.log_metric("train_loss", avg_loss, step=epoch)
 
        # Validation
        model.eval()
        with torch.no_grad():
            X_t = torch.tensor(X_test, dtype=torch.float32)
            preds = model(X_t).argmax(dim=1).numpy()
            acc = (preds == y_test).mean()
            mlflow.log_metric("val_accuracy", acc, step=epoch)
 
    # Log final metrics
    mlflow.log_metric("final_accuracy", acc)
 
    # Log model artifact
    torch.save(model.state_dict(), "model.pt")
    mlflow.log_artifact("model.pt")
 
    # Log as MLflow model (enables registry + serving)
    mlflow.pytorch.log_model(model, "model")
 
print(f"Final accuracy: {acc:.3f}")

# View results
mlflow ui  # opens at http://localhost:5000

W&B: Same Loop (Alternative)

import wandb
 
wandb.init(project="classification", name="mlp-baseline",
           config={"lr": 1e-3, "epochs": 30, "architecture": "MLP-64-32"})
 
for epoch in range(30):
    # ... training code ...
    wandb.log({"train_loss": avg_loss, "val_accuracy": acc, "epoch": epoch})
 
wandb.finish()

Key Tradeoffs

Decision	Option A	Option B
Hosting	Self-hosted MLflow (control, privacy)	Cloud W&B (better UI, zero ops)
Granularity	Log every batch (detailed)	Log every epoch (less storage)
Artifacts	Log all checkpoints (resume from any point)	Log best only (less storage)
Auto-logging	`mlflow.autolog()` (easy, noisy)	Manual logging (precise, more code)

Common Pitfalls

Not tracking data version: you changed the dataset but didn’t record it. Now you can’t reproduce the best run. Hash your data or use DVC
Logging too little: “I’ll remember the settings” — you won’t. Log everything, filter later
Logging too much: saving 100 GB of checkpoints when you only need the best one. Log checkpoints on improvement only
No naming convention: 500 runs named “test”, “test2”, “final”, “final_v2”. Use structured names: resnet18-lr0.001-bs64
Ignoring failed runs: failed runs tell you what doesn’t work. Don’t delete them

Exercises

Set up MLflow locally. Train the same model with 5 different learning rates (1e-2, 1e-3, 1e-4, 3e-3, 3e-4). Use the MLflow UI to compare training curves and pick the best
Add mlflow.autolog() to a PyTorch training script. Compare what it logs automatically versus what you’d log manually. What’s missing?
Log a confusion matrix as an artifact (save as PNG with matplotlib, then mlflow.log_artifact)
Implement a callback that logs GPU memory usage and training speed (samples/sec) per epoch

Self-Test Questions

What is the difference between a parameter and a metric in experiment tracking? Give examples of each
Why should you log the git commit hash alongside your experiment? What problem does it prevent?
How would you compare 50 runs to find the best hyperparameter combination? What UI features help?
When would you choose MLflow over W&B, and vice versa?
What is the risk of mlflow.autolog() compared to manual logging?

AI/ML Notes

Explorer

Experiment Tracking

Experiment Tracking

What

Why It Matters

How It Works

The Tracking Loop

What to Track

Key Tools

Experiment Organization

Code Example

MLflow: Full Training Loop

W&B: Same Loop (Alternative)

Key Tradeoffs

Common Pitfalls

Exercises

Self-Test Questions

Links

Graph View

Table of Contents

Backlinks