Multi-Object Tracking
What
Track multiple objects across video frames. The input is a video; the output is a set of trajectories — each object gets a unique ID that persists across frames.
This is harder than detection. Detection answers “what’s here right now?” Tracking answers “is this the same object I saw last frame?” — the identity problem.
The MOT paradigm: tracking-by-detection
Almost all modern trackers follow the same two-stage pattern:
For each frame:
1. Run a detector (YOLO, Faster R-CNN) → get bounding boxes
2. Associate new detections with existing tracks → assign IDs
The detector handles “what” and “where.” The tracker handles “who is who.”
Why not track directly?
End-to-end tracking models exist (TrackFormer, MOTR) but tracking-by-detection remains dominant because:
- You can upgrade detector and tracker independently
- Detectors are well-studied and highly optimized
- Association is a well-defined combinatorial problem
Association methods
Given N existing tracks and M new detections, how do you match them?
IoU (Intersection over Union) matching
The simplest approach: compute IoU between every track’s predicted position and every new detection. High IoU = probably the same object.
import numpy as np
def iou(box_a, box_b):
"""Compute IoU between two boxes [x1, y1, x2, y2]."""
x1 = max(box_a[0], box_b[0])
y1 = max(box_a[1], box_b[1])
x2 = min(box_a[2], box_b[2])
y2 = min(box_a[3], box_b[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
union = area_a + area_b - inter
return inter / union if union > 0 else 0.0
def iou_matrix(tracks, detections):
"""Compute pairwise IoU between tracks and detections."""
n_tracks = len(tracks)
n_dets = len(detections)
mat = np.zeros((n_tracks, n_dets))
for i in range(n_tracks):
for j in range(n_dets):
mat[i, j] = iou(tracks[i], detections[j])
return matHungarian algorithm
Optimal assignment: minimize total cost (or maximize total IoU) across all track-detection pairs. This is a linear assignment problem solvable in O(n^3).
from scipy.optimize import linear_sum_assignment
def match_hungarian(cost_matrix, threshold=0.3):
"""Match tracks to detections using Hungarian algorithm.
cost_matrix: (n_tracks, n_dets) -- higher = better match (IoU).
Returns matched pairs, unmatched tracks, unmatched detections.
"""
if cost_matrix.size == 0:
return [], list(range(cost_matrix.shape[0])), list(range(cost_matrix.shape[1]))
# linear_sum_assignment minimizes cost, so negate IoU
row_idx, col_idx = linear_sum_assignment(-cost_matrix)
matched, unmatched_tracks, unmatched_dets = [], [], []
for r, c in zip(row_idx, col_idx):
if cost_matrix[r, c] < threshold:
unmatched_tracks.append(r)
unmatched_dets.append(c)
else:
matched.append((r, c))
all_tracks = set(range(cost_matrix.shape[0]))
all_dets = set(range(cost_matrix.shape[1]))
unmatched_tracks += list(all_tracks - set(row_idx))
unmatched_dets += list(all_dets - set(col_idx))
return matched, unmatched_tracks, unmatched_detsAppearance features
IoU alone fails when objects cross paths or get occluded. Appearance features (from a Re-ID network) let you match by what something looks like, not just where it is.
Key trackers
SORT (Simple Online and Realtime Tracking)
The baseline. Published 2016, still relevant because it’s fast and simple.
Components:
- Kalman filter per track: predicts where the object will be next frame
- Hungarian algorithm: matches predictions to new detections by IoU
- Track lifecycle: create new tracks for unmatched detections, delete tracks not matched for N frames
class Track:
"""Minimal track with Kalman-style linear prediction."""
_next_id = 0
def __init__(self, bbox):
self.id = Track._next_id
Track._next_id += 1
self.bbox = np.array(bbox, dtype=float) # [x1, y1, x2, y2]
self.velocity = np.zeros(4)
self.age = 0
self.hits = 1
self.time_since_update = 0
def predict(self):
"""Linear motion prediction."""
self.bbox = self.bbox + self.velocity
self.age += 1
self.time_since_update += 1
return self.bbox
def update(self, bbox):
"""Update track with new detection."""
new_bbox = np.array(bbox, dtype=float)
self.velocity = new_bbox - self.bbox
self.bbox = new_bbox
self.hits += 1
self.time_since_update = 0DeepSORT
Adds a deep appearance descriptor (128-d embedding from a Re-ID CNN) to SORT. When IoU matching fails (occlusion, crossing), appearance similarity can still recover the correct ID.
Key addition: cosine distance on appearance embeddings as secondary matching criterion.
ByteTrack (2022)
Key insight: don’t throw away low-confidence detections. Many trackers only use high-confidence detections (score > 0.5). ByteTrack does two rounds:
- Match high-confidence detections with tracks (standard)
- Match remaining low-confidence detections with unmatched tracks
This recovers objects that are partially occluded or blurred (detector is uncertain, but the track knows something should be there).
Complete IoU tracker from scratch
import numpy as np
from scipy.optimize import linear_sum_assignment
class SimpleTracker:
"""Minimal IoU-based multi-object tracker.
No Kalman filter, no appearance -- just IoU matching + track lifecycle.
"""
def __init__(self, iou_threshold=0.3, max_lost=5):
self.tracks = []
self.iou_threshold = iou_threshold
self.max_lost = max_lost
self._next_id = 0
def update(self, detections):
"""Update tracks with new detections.
detections: list of [x1, y1, x2, y2] bounding boxes.
Returns: list of (track_id, bbox) for active tracks.
"""
if len(self.tracks) == 0:
# First frame: create tracks for all detections
for det in detections:
self._create_track(det)
return [(t["id"], t["bbox"]) for t in self.tracks]
# Predict (simple: use last known position)
predicted = [t["bbox"] for t in self.tracks]
# Build cost matrix
n_tracks = len(predicted)
n_dets = len(detections)
cost = np.zeros((n_tracks, n_dets))
for i in range(n_tracks):
for j in range(n_dets):
cost[i, j] = iou(predicted[i], detections[j])
# Hungarian matching
matched, unmatched_t, unmatched_d = match_hungarian(
cost, self.iou_threshold
)
# Update matched tracks
for t_idx, d_idx in matched:
self.tracks[t_idx]["bbox"] = detections[d_idx]
self.tracks[t_idx]["lost"] = 0
# Increment lost counter for unmatched tracks
for t_idx in unmatched_t:
self.tracks[t_idx]["lost"] += 1
# Create new tracks for unmatched detections
for d_idx in unmatched_d:
self._create_track(detections[d_idx])
# Remove tracks lost too long
self.tracks = [t for t in self.tracks if t["lost"] <= self.max_lost]
return [(t["id"], t["bbox"]) for t in self.tracks]
def _create_track(self, bbox):
self.tracks.append({
"id": self._next_id,
"bbox": list(bbox),
"lost": 0,
})
self._next_id += 1Using ultralytics YOLO built-in tracking
from ultralytics import YOLO
model = YOLO("yolo11n.pt")
# Track objects in a video -- one line
results = model.track(
source="surveillance_feed.mp4",
tracker="bytetrack.yaml", # or "botsort.yaml"
show=True, # display results
stream=True, # process frame by frame (memory efficient)
)
for r in results:
boxes = r.boxes
if boxes.id is not None:
for box, track_id in zip(boxes.xyxy, boxes.id):
x1, y1, x2, y2 = box.tolist()
tid = int(track_id)
print(f"Track {tid}: ({x1:.0f}, {y1:.0f}) -> ({x2:.0f}, {y2:.0f})")Evaluation metrics
| Metric | What it measures | Range |
|---|---|---|
| MOTA | Multi-Object Tracking Accuracy: 1 - (FP + FN + ID_switches) / GT | Can be negative |
| MOTP | Multi-Object Tracking Precision: average IoU of matched pairs | 0-1 |
| IDF1 | How well track IDs are preserved over time | 0-1 |
| HOTA | Higher Order Tracking Accuracy: balances detection and association | 0-1 |
MOTA is the most cited, but it’s dominated by detection quality. IDF1 and HOTA better measure association quality.
# Evaluate with motmetrics library
import motmetrics as mm
acc = mm.MOTAccumulator(auto_id=True)
# For each frame: provide ground truth IDs/boxes and hypothesis IDs/boxes
# acc.update(gt_ids, hyp_ids, distance_matrix)
mh = mm.metrics.create()
summary = mh.compute(acc, metrics=["mota", "motp", "idf1"], name="tracker")
print(summary)Applications
- Drone surveillance: track vehicles and people from aerial video. Challenges: small objects, camera motion, altitude changes
- Traffic monitoring: count vehicles, measure speed, detect incidents
- Crowd analysis: density estimation, flow patterns, anomaly detection
- Sports analytics: player tracking for tactics analysis
- Military: target tracking from ISR (Intelligence, Surveillance, Reconnaissance) platforms
Self-test questions
- Why is tracking-by-detection the dominant paradigm over end-to-end tracking?
- What happens to an IoU-only tracker when two objects cross paths?
- How does ByteTrack improve over SORT by using low-confidence detections?
- When would you choose IDF1 over MOTA as your primary metric?
- What is the computational complexity of the Hungarian algorithm, and why does it matter at scale?
Exercises
- IoU tracker: Run the SimpleTracker above on synthetic data (generate 5 objects moving linearly with random noise). Visualize tracks with matplotlib.
- Add Kalman prediction: Replace the “use last position” prediction in SimpleTracker with a constant-velocity Kalman filter. Compare tracking quality when objects temporarily disappear.
- MOT evaluation: Download MOT17 dataset (small subset), run ultralytics tracker, evaluate with motmetrics. Compare ByteTrack vs BoT-SORT.
Links
- Object Detection — the detection stage
- Image Classification — appearance features come from classifiers
- Optical Flow — alternative motion cue for tracking
- Video Understanding — tracking feeds into video understanding
- Tutorial - Object Tracking Pipeline — hands-on tutorial