Object Detection

What Is Object Detection

Object detection combines two tasks:

  1. Classification — What is in this region?
  2. Localization — Where is it?

Output: A set of bounding boxes, each with a class label and confidence score.

Image → [Detector] → [(class, confidence, x1, y1, x2, y2), ...]

This is fundamentally harder than image classification (single label for entire image) because you need to handle:

  • Variable number of objects per image
  • Multiple classes in one image
  • Exact location required, not just “something in this image”

The Evolution

EraApproachSpeedAccuracyLimitation
2012-2015Sliding window + classifier (RCNN family)Very slowModerate~2000 windows per image
2015Fast R-CNN (Girshick 2015)ModerateHighRegion proposals still slow
2015Faster R-CNN (Ren et al. 2015)ModerateVery highTwo-stage still slow
2016SSD (Liu et al. 2016)FastGoodMissing small objects
2016YOLO v1 (Redmon et al. 2016)Very fastModerateGrid-based, misses small
2018+YOLO v3-v8FastGood-HighOne-stage is good enough
2020DETR (Carion et al. 2020)ModerateHighTransformer-based, no anchors
2023+YOLO v8/v11Very fastHighBest for production

Two Paradigms

Two-Stage Detectors

Region Proposal + Classification:

  1. Proposal network suggests regions that might contain objects (~300-2000)
  2. ROI pooling extracts features for each region
  3. Classifier assigns class + refines bounding box

Faster R-CNN is the canonical two-stage detector:

Image → CNN backbone → Feature Maps → 
  RPN (Region Proposal Network) → ROIs → 
  ROI Pooler → FC layers → 
  Class + Box regression

Pros: Highest accuracy, especially for small objects Cons: Slower (~5-10 FPS for large images)

One-Stage Detectors

Direct prediction from features:

Process the image once, predict boxes and classes from feature maps at multiple scales.

YOLO (You Only Look Once) is the canonical one-stage:

Image → CNN backbone → Feature maps at different scales (P3, P4, P5) →
  For each grid cell: predict (class_probs, objectness, bounding_box)

YOLO divides the image into a grid (e.g., 13×13 for 416×416 input). Each grid cell predicts:

  • B bounding boxes (each with x, y, w, h, confidence)
  • C class probabilities

Pros: Fast (~30-100+ FPS), good for real-time Cons: Historically lower accuracy on small objects (fixed grid limitation)

Key Innovation: FPN (Feature Pyramid Network)

Multi-scale feature fusion — detect objects at different scales using feature maps from different backbone layers:

High-res (small receptive field) → small objects
Low-res (large receptive field) → large objects

Modern YOLOs use FPN + PAN (Path Aggregation Network) for better multi-scale detection.

Key Concepts

Intersection over Union (IoU)

Measures overlap between predicted and ground truth boxes:

IoU = Area(Overlap) / Area(Union)
    Predicted Box
    ┌─────────┐
    │   ┌───┐ │
    │   │GT │ │
    │   └───┘ │
    └─────────┘
    
IoU = overlap_area / (pred_area + gt_area - overlap_area)

IoU = 1.0 is perfect. IoU > 0.5 is usually considered a “hit.”

Anchor Boxes

Pre-defined bounding box shapes (width, height) that the detector regresses from. Instead of predicting absolute coordinates, the detector learns offsets from anchor templates.

Why anchors? Reduces the search space. Instead of predicting arbitrary boxes, learn small adjustments to a set of templates.

Typical anchor configuration (YOLOv3):

  • 3 scales per feature level
  • 3 aspect ratios per scale
  • 9 anchors total

Modern YOLO (v8+) uses anchor-free detection — predicts directly from points, reducing the need for anchor tuning.

Non-Maximum Suppression (NMS)

After detection, you get many overlapping boxes for the same object. NMS removes duplicates:

1. Sort all boxes by confidence (highest first)
2. For each box:
   - If IoU with any previously-kept box > threshold (e.g., 0.5), suppress it
   - Otherwise, keep it
3. Return remaining boxes

Mean Average Precision (mAP)

The standard metric for object detection:

  1. Precision = TP / (TP + FP) at each confidence threshold
  2. Recall = TP / (TP + FN) = TP / Total ground truth
  3. Average Precision (AP) = Area under precision-recall curve for each class
  4. mAP = Mean of AP across all classes
  5. mAP@0.5 = mAP at IoU threshold 0.5 (lenient)
  6. mAP@0.5:0.95 = mAP averaged over IoU thresholds 0.5 to 0.95 (strict)

mAP@0.5 is the most commonly reported metric. mAP@0.5:0.95 is more rigorous.

YOLO Deep Dive

Modern YOLO Architecture (v8/v11)

YOLOv8 (2023) introduced:

  • Anchor-free head — CSPNet backbone + PANet neck
  • Decoupled classification and regression heads (like FCOS)
  • Binary cross-entropy + DIoU loss for box regression
  • Mosaic augmentation during training

Model sizes (COCO pretrained):

ModelmAP@0.5:0.95FPS (V100)
YOLOv8n (nano)37.4420
YOLOv8s (small)44.9220
YOLOv8m (medium)50.290
YOLOv8l (large)52.960
YOLOv8x (xlarge)54.120

Using YOLOv8 for Your Own Data

from ultralytics import YOLO
 
# Load pretrained model
model = YOLO("yolov8m.pt")
 
# Train on custom data (YOLO format: images/ + labels/)
# annotations.yaml:
# path: ./data
# train: images/train
# val: images/val
# nc: 3
# names: ['person', 'car', 'truck']
 
results = model.train(data="annotations.yaml", epochs=100, imgsz=640)
 
# Validate
metrics = model.val()
 
# Predict
results = model("test_image.jpg")
 
# Export to ONNX for deployment
model.export(format="onnx")

YOLO Format (Label Format)

One .txt file per image, same name:

<class_id> <x_center> <y_center> <width> <height>

All values normalized 0-1.

Beyond YOLO

DETR (End-to-End Object Detection with Transformers)

Carion et al. (2020) applied the transformer encoder-decoder to detection:

Image → CNN backbone → Flatten → Transformer Encoder → 
Query embeddings → Transformer Decoder → 
BBox prediction head + Class prediction head

Key innovation: No anchor boxes, no NMS (匈牙利 algorithm matching in training). Direct set prediction.

Pros: Simple pipeline, no hand-crafted components Cons: Slower to train (requires longer schedule), worse on small objects

RT-DETR (Real-Time DETR)

2023 — Real-time version using:

  • Hybrid encoder (CNN + attention)
  • Query selection (keep most confident regions)
  • Faster convergence than original DETR

Detectron2 (Facebook AI)

PyTorch-based framework implementing Faster R-CNN, RetinaNet, Mask R-CNN, etc. Research standard, not production-optimized.

Common Pitfalls

Data Quality

  • Bounding box accuracy matters more than model choice
  • Missing annotations are deadly — model learns to predict “nothing” for poorly annotated classes
  • Class imbalance — if you have 10,000 cars and 10 bicycles, model ignores bicycles

Small Objects

  • Low resolution input loses small objects — use larger input size (1280 vs 640)
  • Multi-scale detection is essential — FPN/PAN architecture
  • Augmentation that preserves small objects (copy-paste, mosaic can help)

Deployment

  • INT8 quantization can speed up 2-3x with ~1% mAP loss
  • TensorRT is the standard for GPU inference optimization
  • ONNX for cross-platform deployment

Key Papers