Multi-Object Tracking

What

Track multiple objects across video frames. The input is a video; the output is a set of trajectories — each object gets a unique ID that persists across frames.

This is harder than detection. Detection answers “what’s here right now?” Tracking answers “is this the same object I saw last frame?” — the identity problem.

The MOT paradigm: tracking-by-detection

Almost all modern trackers follow the same two-stage pattern:

For each frame:
  1. Run a detector (YOLO, Faster R-CNN) → get bounding boxes
  2. Associate new detections with existing tracks → assign IDs

The detector handles “what” and “where.” The tracker handles “who is who.”

Why not track directly?

End-to-end tracking models exist (TrackFormer, MOTR) but tracking-by-detection remains dominant because:

  • You can upgrade detector and tracker independently
  • Detectors are well-studied and highly optimized
  • Association is a well-defined combinatorial problem

Association methods

Given N existing tracks and M new detections, how do you match them?

IoU (Intersection over Union) matching

The simplest approach: compute IoU between every track’s predicted position and every new detection. High IoU = probably the same object.

import numpy as np
 
def iou(box_a, box_b):
    """Compute IoU between two boxes [x1, y1, x2, y2]."""
    x1 = max(box_a[0], box_b[0])
    y1 = max(box_a[1], box_b[1])
    x2 = min(box_a[2], box_b[2])
    y2 = min(box_a[3], box_b[3])
 
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
    area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
    union = area_a + area_b - inter
 
    return inter / union if union > 0 else 0.0
 
def iou_matrix(tracks, detections):
    """Compute pairwise IoU between tracks and detections."""
    n_tracks = len(tracks)
    n_dets = len(detections)
    mat = np.zeros((n_tracks, n_dets))
    for i in range(n_tracks):
        for j in range(n_dets):
            mat[i, j] = iou(tracks[i], detections[j])
    return mat

Hungarian algorithm

Optimal assignment: minimize total cost (or maximize total IoU) across all track-detection pairs. This is a linear assignment problem solvable in O(n^3).

from scipy.optimize import linear_sum_assignment
 
def match_hungarian(cost_matrix, threshold=0.3):
    """Match tracks to detections using Hungarian algorithm.
    cost_matrix: (n_tracks, n_dets) -- higher = better match (IoU).
    Returns matched pairs, unmatched tracks, unmatched detections.
    """
    if cost_matrix.size == 0:
        return [], list(range(cost_matrix.shape[0])), list(range(cost_matrix.shape[1]))
 
    # linear_sum_assignment minimizes cost, so negate IoU
    row_idx, col_idx = linear_sum_assignment(-cost_matrix)
 
    matched, unmatched_tracks, unmatched_dets = [], [], []
    for r, c in zip(row_idx, col_idx):
        if cost_matrix[r, c] < threshold:
            unmatched_tracks.append(r)
            unmatched_dets.append(c)
        else:
            matched.append((r, c))
 
    all_tracks = set(range(cost_matrix.shape[0]))
    all_dets = set(range(cost_matrix.shape[1]))
    unmatched_tracks += list(all_tracks - set(row_idx))
    unmatched_dets += list(all_dets - set(col_idx))
 
    return matched, unmatched_tracks, unmatched_dets

Appearance features

IoU alone fails when objects cross paths or get occluded. Appearance features (from a Re-ID network) let you match by what something looks like, not just where it is.

Key trackers

SORT (Simple Online and Realtime Tracking)

The baseline. Published 2016, still relevant because it’s fast and simple.

Components:

  1. Kalman filter per track: predicts where the object will be next frame
  2. Hungarian algorithm: matches predictions to new detections by IoU
  3. Track lifecycle: create new tracks for unmatched detections, delete tracks not matched for N frames
class Track:
    """Minimal track with Kalman-style linear prediction."""
    _next_id = 0
 
    def __init__(self, bbox):
        self.id = Track._next_id
        Track._next_id += 1
        self.bbox = np.array(bbox, dtype=float)  # [x1, y1, x2, y2]
        self.velocity = np.zeros(4)
        self.age = 0
        self.hits = 1
        self.time_since_update = 0
 
    def predict(self):
        """Linear motion prediction."""
        self.bbox = self.bbox + self.velocity
        self.age += 1
        self.time_since_update += 1
        return self.bbox
 
    def update(self, bbox):
        """Update track with new detection."""
        new_bbox = np.array(bbox, dtype=float)
        self.velocity = new_bbox - self.bbox
        self.bbox = new_bbox
        self.hits += 1
        self.time_since_update = 0

DeepSORT

Adds a deep appearance descriptor (128-d embedding from a Re-ID CNN) to SORT. When IoU matching fails (occlusion, crossing), appearance similarity can still recover the correct ID.

Key addition: cosine distance on appearance embeddings as secondary matching criterion.

ByteTrack (2022)

Key insight: don’t throw away low-confidence detections. Many trackers only use high-confidence detections (score > 0.5). ByteTrack does two rounds:

  1. Match high-confidence detections with tracks (standard)
  2. Match remaining low-confidence detections with unmatched tracks

This recovers objects that are partially occluded or blurred (detector is uncertain, but the track knows something should be there).

Complete IoU tracker from scratch

import numpy as np
from scipy.optimize import linear_sum_assignment
 
class SimpleTracker:
    """Minimal IoU-based multi-object tracker.
    No Kalman filter, no appearance -- just IoU matching + track lifecycle.
    """
    def __init__(self, iou_threshold=0.3, max_lost=5):
        self.tracks = []
        self.iou_threshold = iou_threshold
        self.max_lost = max_lost
        self._next_id = 0
 
    def update(self, detections):
        """Update tracks with new detections.
        detections: list of [x1, y1, x2, y2] bounding boxes.
        Returns: list of (track_id, bbox) for active tracks.
        """
        if len(self.tracks) == 0:
            # First frame: create tracks for all detections
            for det in detections:
                self._create_track(det)
            return [(t["id"], t["bbox"]) for t in self.tracks]
 
        # Predict (simple: use last known position)
        predicted = [t["bbox"] for t in self.tracks]
 
        # Build cost matrix
        n_tracks = len(predicted)
        n_dets = len(detections)
        cost = np.zeros((n_tracks, n_dets))
        for i in range(n_tracks):
            for j in range(n_dets):
                cost[i, j] = iou(predicted[i], detections[j])
 
        # Hungarian matching
        matched, unmatched_t, unmatched_d = match_hungarian(
            cost, self.iou_threshold
        )
 
        # Update matched tracks
        for t_idx, d_idx in matched:
            self.tracks[t_idx]["bbox"] = detections[d_idx]
            self.tracks[t_idx]["lost"] = 0
 
        # Increment lost counter for unmatched tracks
        for t_idx in unmatched_t:
            self.tracks[t_idx]["lost"] += 1
 
        # Create new tracks for unmatched detections
        for d_idx in unmatched_d:
            self._create_track(detections[d_idx])
 
        # Remove tracks lost too long
        self.tracks = [t for t in self.tracks if t["lost"] <= self.max_lost]
 
        return [(t["id"], t["bbox"]) for t in self.tracks]
 
    def _create_track(self, bbox):
        self.tracks.append({
            "id": self._next_id,
            "bbox": list(bbox),
            "lost": 0,
        })
        self._next_id += 1

Using ultralytics YOLO built-in tracking

from ultralytics import YOLO
 
model = YOLO("yolo11n.pt")
 
# Track objects in a video -- one line
results = model.track(
    source="surveillance_feed.mp4",
    tracker="bytetrack.yaml",  # or "botsort.yaml"
    show=True,                 # display results
    stream=True,               # process frame by frame (memory efficient)
)
 
for r in results:
    boxes = r.boxes
    if boxes.id is not None:
        for box, track_id in zip(boxes.xyxy, boxes.id):
            x1, y1, x2, y2 = box.tolist()
            tid = int(track_id)
            print(f"Track {tid}: ({x1:.0f}, {y1:.0f}) -> ({x2:.0f}, {y2:.0f})")

Evaluation metrics

MetricWhat it measuresRange
MOTAMulti-Object Tracking Accuracy: 1 - (FP + FN + ID_switches) / GTCan be negative
MOTPMulti-Object Tracking Precision: average IoU of matched pairs0-1
IDF1How well track IDs are preserved over time0-1
HOTAHigher Order Tracking Accuracy: balances detection and association0-1

MOTA is the most cited, but it’s dominated by detection quality. IDF1 and HOTA better measure association quality.

# Evaluate with motmetrics library
import motmetrics as mm
 
acc = mm.MOTAccumulator(auto_id=True)
 
# For each frame: provide ground truth IDs/boxes and hypothesis IDs/boxes
# acc.update(gt_ids, hyp_ids, distance_matrix)
 
mh = mm.metrics.create()
summary = mh.compute(acc, metrics=["mota", "motp", "idf1"], name="tracker")
print(summary)

Applications

  • Drone surveillance: track vehicles and people from aerial video. Challenges: small objects, camera motion, altitude changes
  • Traffic monitoring: count vehicles, measure speed, detect incidents
  • Crowd analysis: density estimation, flow patterns, anomaly detection
  • Sports analytics: player tracking for tactics analysis
  • Military: target tracking from ISR (Intelligence, Surveillance, Reconnaissance) platforms

Self-test questions

  1. Why is tracking-by-detection the dominant paradigm over end-to-end tracking?
  2. What happens to an IoU-only tracker when two objects cross paths?
  3. How does ByteTrack improve over SORT by using low-confidence detections?
  4. When would you choose IDF1 over MOTA as your primary metric?
  5. What is the computational complexity of the Hungarian algorithm, and why does it matter at scale?

Exercises

  1. IoU tracker: Run the SimpleTracker above on synthetic data (generate 5 objects moving linearly with random noise). Visualize tracks with matplotlib.
  2. Add Kalman prediction: Replace the “use last position” prediction in SimpleTracker with a constant-velocity Kalman filter. Compare tracking quality when objects temporarily disappear.
  3. MOT evaluation: Download MOT17 dataset (small subset), run ultralytics tracker, evaluate with motmetrics. Compare ByteTrack vs BoT-SORT.