Object Detection
What Is Object Detection
Object detection combines two tasks:
- Classification — What is in this region?
- Localization — Where is it?
Output: A set of bounding boxes, each with a class label and confidence score.
Image → [Detector] → [(class, confidence, x1, y1, x2, y2), ...]
This is fundamentally harder than image classification (single label for entire image) because you need to handle:
- Variable number of objects per image
- Multiple classes in one image
- Exact location required, not just “something in this image”
The Evolution
| Era | Approach | Speed | Accuracy | Limitation |
|---|---|---|---|---|
| 2012-2015 | Sliding window + classifier (RCNN family) | Very slow | Moderate | ~2000 windows per image |
| 2015 | Fast R-CNN (Girshick 2015) | Moderate | High | Region proposals still slow |
| 2015 | Faster R-CNN (Ren et al. 2015) | Moderate | Very high | Two-stage still slow |
| 2016 | SSD (Liu et al. 2016) | Fast | Good | Missing small objects |
| 2016 | YOLO v1 (Redmon et al. 2016) | Very fast | Moderate | Grid-based, misses small |
| 2018+ | YOLO v3-v8 | Fast | Good-High | One-stage is good enough |
| 2020 | DETR (Carion et al. 2020) | Moderate | High | Transformer-based, no anchors |
| 2023+ | YOLO v8/v11 | Very fast | High | Best for production |
Two Paradigms
Two-Stage Detectors
Region Proposal + Classification:
- Proposal network suggests regions that might contain objects (~300-2000)
- ROI pooling extracts features for each region
- Classifier assigns class + refines bounding box
Faster R-CNN is the canonical two-stage detector:
Image → CNN backbone → Feature Maps →
RPN (Region Proposal Network) → ROIs →
ROI Pooler → FC layers →
Class + Box regression
Pros: Highest accuracy, especially for small objects Cons: Slower (~5-10 FPS for large images)
One-Stage Detectors
Direct prediction from features:
Process the image once, predict boxes and classes from feature maps at multiple scales.
YOLO (You Only Look Once) is the canonical one-stage:
Image → CNN backbone → Feature maps at different scales (P3, P4, P5) →
For each grid cell: predict (class_probs, objectness, bounding_box)
YOLO divides the image into a grid (e.g., 13×13 for 416×416 input). Each grid cell predicts:
- B bounding boxes (each with x, y, w, h, confidence)
- C class probabilities
Pros: Fast (~30-100+ FPS), good for real-time Cons: Historically lower accuracy on small objects (fixed grid limitation)
Key Innovation: FPN (Feature Pyramid Network)
Multi-scale feature fusion — detect objects at different scales using feature maps from different backbone layers:
High-res (small receptive field) → small objects
Low-res (large receptive field) → large objects
Modern YOLOs use FPN + PAN (Path Aggregation Network) for better multi-scale detection.
Key Concepts
Intersection over Union (IoU)
Measures overlap between predicted and ground truth boxes:
IoU = Area(Overlap) / Area(Union)
Predicted Box
┌─────────┐
│ ┌───┐ │
│ │GT │ │
│ └───┘ │
└─────────┘
IoU = overlap_area / (pred_area + gt_area - overlap_area)
IoU = 1.0 is perfect. IoU > 0.5 is usually considered a “hit.”
Anchor Boxes
Pre-defined bounding box shapes (width, height) that the detector regresses from. Instead of predicting absolute coordinates, the detector learns offsets from anchor templates.
Why anchors? Reduces the search space. Instead of predicting arbitrary boxes, learn small adjustments to a set of templates.
Typical anchor configuration (YOLOv3):
- 3 scales per feature level
- 3 aspect ratios per scale
- 9 anchors total
Modern YOLO (v8+) uses anchor-free detection — predicts directly from points, reducing the need for anchor tuning.
Non-Maximum Suppression (NMS)
After detection, you get many overlapping boxes for the same object. NMS removes duplicates:
1. Sort all boxes by confidence (highest first)
2. For each box:
- If IoU with any previously-kept box > threshold (e.g., 0.5), suppress it
- Otherwise, keep it
3. Return remaining boxes
Mean Average Precision (mAP)
The standard metric for object detection:
- Precision = TP / (TP + FP) at each confidence threshold
- Recall = TP / (TP + FN) = TP / Total ground truth
- Average Precision (AP) = Area under precision-recall curve for each class
- mAP = Mean of AP across all classes
- mAP@0.5 = mAP at IoU threshold 0.5 (lenient)
- mAP@0.5:0.95 = mAP averaged over IoU thresholds 0.5 to 0.95 (strict)
mAP@0.5 is the most commonly reported metric. mAP@0.5:0.95 is more rigorous.
YOLO Deep Dive
Modern YOLO Architecture (v8/v11)
YOLOv8 (2023) introduced:
- Anchor-free head — CSPNet backbone + PANet neck
- Decoupled classification and regression heads (like FCOS)
- Binary cross-entropy + DIoU loss for box regression
- Mosaic augmentation during training
Model sizes (COCO pretrained):
| Model | mAP@0.5:0.95 | FPS (V100) |
|---|---|---|
| YOLOv8n (nano) | 37.4 | 420 |
| YOLOv8s (small) | 44.9 | 220 |
| YOLOv8m (medium) | 50.2 | 90 |
| YOLOv8l (large) | 52.9 | 60 |
| YOLOv8x (xlarge) | 54.1 | 20 |
Using YOLOv8 for Your Own Data
from ultralytics import YOLO
# Load pretrained model
model = YOLO("yolov8m.pt")
# Train on custom data (YOLO format: images/ + labels/)
# annotations.yaml:
# path: ./data
# train: images/train
# val: images/val
# nc: 3
# names: ['person', 'car', 'truck']
results = model.train(data="annotations.yaml", epochs=100, imgsz=640)
# Validate
metrics = model.val()
# Predict
results = model("test_image.jpg")
# Export to ONNX for deployment
model.export(format="onnx")YOLO Format (Label Format)
One .txt file per image, same name:
<class_id> <x_center> <y_center> <width> <height>
All values normalized 0-1.
Beyond YOLO
DETR (End-to-End Object Detection with Transformers)
Carion et al. (2020) applied the transformer encoder-decoder to detection:
Image → CNN backbone → Flatten → Transformer Encoder →
Query embeddings → Transformer Decoder →
BBox prediction head + Class prediction head
Key innovation: No anchor boxes, no NMS (匈牙利 algorithm matching in training). Direct set prediction.
Pros: Simple pipeline, no hand-crafted components Cons: Slower to train (requires longer schedule), worse on small objects
RT-DETR (Real-Time DETR)
2023 — Real-time version using:
- Hybrid encoder (CNN + attention)
- Query selection (keep most confident regions)
- Faster convergence than original DETR
Detectron2 (Facebook AI)
PyTorch-based framework implementing Faster R-CNN, RetinaNet, Mask R-CNN, etc. Research standard, not production-optimized.
Common Pitfalls
Data Quality
- Bounding box accuracy matters more than model choice
- Missing annotations are deadly — model learns to predict “nothing” for poorly annotated classes
- Class imbalance — if you have 10,000 cars and 10 bicycles, model ignores bicycles
Small Objects
- Low resolution input loses small objects — use larger input size (1280 vs 640)
- Multi-scale detection is essential — FPN/PAN architecture
- Augmentation that preserves small objects (copy-paste, mosaic can help)
Deployment
- INT8 quantization can speed up 2-3x with ~1% mAP loss
- TensorRT is the standard for GPU inference optimization
- ONNX for cross-platform deployment
Key Papers
- Girshick (2015) — “Fast R-CNN” — https://arxiv.org/abs/1504.08083
- Ren et al. (2015) — “Faster R-CNN” — https://arxiv.org/abs/1506.01497
- Redmon et al. (2016) — “You Only Look Once” — https://arxiv.org/abs/1506.02640
- Liu et al. (2016) — “SSD: Single Shot MultiBox Detector” — https://arxiv.org/abs/1512.02325
- Carion et al. (2020) — “End-to-End Object Detection with Transformers” — https://arxiv.org/abs/2005.12872
- Jocher et al. (2023) — YOLOv8 — https://github.com/ultralytics/ultralytics
Links
- Image Classification — The single-label task that object detection builds on
- Image Segmentation — Instance and semantic segmentation
- Convolutional Neural Networks — The backbone architecture
- YOLO Tutorial — Hands-on object detection with YOLOv8
- Computer Vision Roadmap — Where object detection fits in the CV curriculum