Object Detection

What Is Object Detection

Object detection combines two tasks:

Classification — What is in this region?
Localization — Where is it?

Output: A set of bounding boxes, each with a class label and confidence score.

Image → [Detector] → [(class, confidence, x1, y1, x2, y2), ...]

This is fundamentally harder than image classification (single label for entire image) because you need to handle:

Variable number of objects per image
Multiple classes in one image
Exact location required, not just “something in this image”

The Evolution

Era	Approach	Speed	Accuracy	Limitation
2012-2015	Sliding window + classifier (RCNN family)	Very slow	Moderate	~2000 windows per image
2015	Fast R-CNN (Girshick 2015)	Moderate	High	Region proposals still slow
2015	Faster R-CNN (Ren et al. 2015)	Moderate	Very high	Two-stage still slow
2016	SSD (Liu et al. 2016)	Fast	Good	Missing small objects
2016	YOLO v1 (Redmon et al. 2016)	Very fast	Moderate	Grid-based, misses small
2018+	YOLO v3-v8	Fast	Good-High	One-stage is good enough
2020	DETR (Carion et al. 2020)	Moderate	High	Transformer-based, no anchors
2023+	YOLO v8/v11	Very fast	High	Best for production

Two Paradigms

Two-Stage Detectors

Region Proposal + Classification:

Proposal network suggests regions that might contain objects (~300-2000)
ROI pooling extracts features for each region
Classifier assigns class + refines bounding box

Faster R-CNN is the canonical two-stage detector:

Image → CNN backbone → Feature Maps → 
  RPN (Region Proposal Network) → ROIs → 
  ROI Pooler → FC layers → 
  Class + Box regression

Pros: Highest accuracy, especially for small objects Cons: Slower (~5-10 FPS for large images)

One-Stage Detectors

Direct prediction from features:

Process the image once, predict boxes and classes from feature maps at multiple scales.

YOLO (You Only Look Once) is the canonical one-stage:

Image → CNN backbone → Feature maps at different scales (P3, P4, P5) →
  For each grid cell: predict (class_probs, objectness, bounding_box)

YOLO divides the image into a grid (e.g., 13×13 for 416×416 input). Each grid cell predicts:

B bounding boxes (each with x, y, w, h, confidence)
C class probabilities

Pros: Fast (~30-100+ FPS), good for real-time Cons: Historically lower accuracy on small objects (fixed grid limitation)

Key Innovation: FPN (Feature Pyramid Network)

Multi-scale feature fusion — detect objects at different scales using feature maps from different backbone layers:

High-res (small receptive field) → small objects
Low-res (large receptive field) → large objects

Modern YOLOs use FPN + PAN (Path Aggregation Network) for better multi-scale detection.

Key Concepts

Intersection over Union (IoU)

Measures overlap between predicted and ground truth boxes:

IoU = Area(Overlap) / Area(Union)

    Predicted Box
    ┌─────────┐
    │   ┌───┐ │
    │   │GT │ │
    │   └───┘ │
    └─────────┘
    
IoU = overlap_area / (pred_area + gt_area - overlap_area)

IoU = 1.0 is perfect. IoU > 0.5 is usually considered a “hit.”

Anchor Boxes

Pre-defined bounding box shapes (width, height) that the detector regresses from. Instead of predicting absolute coordinates, the detector learns offsets from anchor templates.

Why anchors? Reduces the search space. Instead of predicting arbitrary boxes, learn small adjustments to a set of templates.

Typical anchor configuration (YOLOv3):

3 scales per feature level
3 aspect ratios per scale
9 anchors total

Modern YOLO (v8+) uses anchor-free detection — predicts directly from points, reducing the need for anchor tuning.

Non-Maximum Suppression (NMS)

After detection, you get many overlapping boxes for the same object. NMS removes duplicates:

1. Sort all boxes by confidence (highest first)
2. For each box:
   - If IoU with any previously-kept box > threshold (e.g., 0.5), suppress it
   - Otherwise, keep it
3. Return remaining boxes

Mean Average Precision (mAP)

The standard metric for object detection:

Precision = TP / (TP + FP) at each confidence threshold
Recall = TP / (TP + FN) = TP / Total ground truth
Average Precision (AP) = Area under precision-recall curve for each class
mAP = Mean of AP across all classes
mAP@0.5 = mAP at IoU threshold 0.5 (lenient)
mAP@0.5:0.95 = mAP averaged over IoU thresholds 0.5 to 0.95 (strict)

mAP@0.5 is the most commonly reported metric. mAP@0.5:0.95 is more rigorous.

YOLO Deep Dive

Modern YOLO Architecture (v8/v11)

YOLOv8 (2023) introduced:

Anchor-free head — CSPNet backbone + PANet neck
Decoupled classification and regression heads (like FCOS)
Binary cross-entropy + DIoU loss for box regression
Mosaic augmentation during training

Model sizes (COCO pretrained):

Model	mAP@0.5:0.95	FPS (V100)
YOLOv8n (nano)	37.4	420
YOLOv8s (small)	44.9	220
YOLOv8m (medium)	50.2	90
YOLOv8l (large)	52.9	60
YOLOv8x (xlarge)	54.1	20

Using YOLOv8 for Your Own Data

from ultralytics import YOLO
 
# Load pretrained model
model = YOLO("yolov8m.pt")
 
# Train on custom data (YOLO format: images/ + labels/)
# annotations.yaml:
# path: ./data
# train: images/train
# val: images/val
# nc: 3
# names: ['person', 'car', 'truck']
 
results = model.train(data="annotations.yaml", epochs=100, imgsz=640)
 
# Validate
metrics = model.val()
 
# Predict
results = model("test_image.jpg")
 
# Export to ONNX for deployment
model.export(format="onnx")

YOLO Format (Label Format)

One .txt file per image, same name:

<class_id> <x_center> <y_center> <width> <height>

All values normalized 0-1.

Beyond YOLO

DETR (End-to-End Object Detection with Transformers)

Carion et al. (2020) applied the transformer encoder-decoder to detection:

Image → CNN backbone → Flatten → Transformer Encoder → 
Query embeddings → Transformer Decoder → 
BBox prediction head + Class prediction head

Key innovation: No anchor boxes, no NMS (匈牙利 algorithm matching in training). Direct set prediction.

Pros: Simple pipeline, no hand-crafted components Cons: Slower to train (requires longer schedule), worse on small objects

RT-DETR (Real-Time DETR)

2023 — Real-time version using:

Hybrid encoder (CNN + attention)
Query selection (keep most confident regions)
Faster convergence than original DETR

Detectron2 (Facebook AI)

PyTorch-based framework implementing Faster R-CNN, RetinaNet, Mask R-CNN, etc. Research standard, not production-optimized.

Common Pitfalls

Data Quality

Bounding box accuracy matters more than model choice
Missing annotations are deadly — model learns to predict “nothing” for poorly annotated classes
Class imbalance — if you have 10,000 cars and 10 bicycles, model ignores bicycles

Small Objects

Low resolution input loses small objects — use larger input size (1280 vs 640)
Multi-scale detection is essential — FPN/PAN architecture
Augmentation that preserves small objects (copy-paste, mosaic can help)

Deployment

INT8 quantization can speed up 2-3x with ~1% mAP loss
TensorRT is the standard for GPU inference optimization
ONNX for cross-platform deployment

Key Papers

Girshick (2015) — “Fast R-CNN” — https://arxiv.org/abs/1504.08083
Ren et al. (2015) — “Faster R-CNN” — https://arxiv.org/abs/1506.01497
Redmon et al. (2016) — “You Only Look Once” — https://arxiv.org/abs/1506.02640
Liu et al. (2016) — “SSD: Single Shot MultiBox Detector” — https://arxiv.org/abs/1512.02325
Carion et al. (2020) — “End-to-End Object Detection with Transformers” — https://arxiv.org/abs/2005.12872
Jocher et al. (2023) — YOLOv8 — https://github.com/ultralytics/ultralytics

AI/ML Notes

Explorer

Object Detection

Object Detection

What Is Object Detection

The Evolution

Two Paradigms

Two-Stage Detectors

One-Stage Detectors

Key Innovation: FPN (Feature Pyramid Network)

Key Concepts

Intersection over Union (IoU)

Anchor Boxes

Non-Maximum Suppression (NMS)

Mean Average Precision (mAP)

YOLO Deep Dive

Modern YOLO Architecture (v8/v11)

Using YOLOv8 for Your Own Data

YOLO Format (Label Format)

Beyond YOLO

DETR (End-to-End Object Detection with Transformers)

RT-DETR (Real-Time DETR)

Detectron2 (Facebook AI)

Common Pitfalls

Data Quality

Small Objects

Deployment

Key Papers

Links

Graph View

Table of Contents

Backlinks