Case Study - CV Pipeline Design

Three scenarios that require computer vision engineering judgment. For each: understand the problem, think about what you would do, then read the analysis.

Scenario 1: Drone surveillance feed

Problem

You are processing real-time video from an FPV drone flying at 50-100m altitude over a road network. The task: track all vehicles in the field of view, report their positions and trajectories, and flag any that stop or reverse direction.

Constraints:

Jetson Orin Nano onboard (8GB, ~40 TOPS INT8)
720p video at 30 FPS
Latency requirement: < 200ms end-to-end
The drone moves, so the background moves too
Vehicles appear small (20-60 pixels wide)
Must work day and night (IR camera available)

Pause — what would you do?

Think about: detection model, tracker choice, how to handle camera motion, day/night strategy.

Analysis

Detection model: YOLO11n or YOLO11s quantized to INT8 for Jetson. The nano model runs ~100 FPS on Jetson Orin at 640px input. For small objects at altitude, you have two options:

Run detection at full resolution with small anchor tuning
Run SAHI (Slicing Aided Hyper Inference): tile the image into overlapping patches, detect per patch, merge results. Slower but catches small objects.

Recommendation: start with full-res YOLO11s-INT8. If small vehicle recall is too low, add SAHI with 2x2 tiling.

Tracker: ByteTrack. It’s fast (< 1ms per frame), handles brief occlusions via low-confidence detections, and has no deep appearance model to slow things down. At this altitude, vehicles are small and appearances are not distinctive enough for Re-ID anyway.

Camera motion compensation: The drone moves, so IoU-based tracking fails (the same vehicle shifts position between frames even when stationary). Solution:

Estimate inter-frame homography (cv2.findHomography on matched ORB features)
Warp previous track positions to current frame coordinates
Then do IoU matching in the stabilized coordinate system

This is critical. Without it, every drone camera pan creates false ID switches.

Day/night: Train on both RGB and IR images. YOLO handles IR well if included in training data. For night, switch to IR input automatically based on ambient light sensor.

Anomaly detection (stopped/reversing vehicles): compute vehicle velocity from track trajectory (smoothed over 10 frames). Flag velocity < threshold (stopped) or velocity direction reversal. This is simple once tracking works.

Latency budget:

Component	Time
Frame capture	~5ms
YOLO detection	~15ms
Homography estimation	~3ms
ByteTrack	~1ms
Anomaly check	~1ms
Visualization	~5ms
Total	~30ms

Well within the 200ms requirement.

Key tradeoff

Accuracy vs latency. If you need to catch every small vehicle, you need SAHI tiling, which 3-4x the detection time. For most operational scenarios, missing a few distant vehicles is acceptable; catching vehicles in the near field (where they’re larger) matters more.

Scenario 2: Satellite change detection

Problem

Monitor a military facility for new construction or significant changes. You receive Sentinel-2 imagery every 5 days at 10m resolution. The task: automatically flag meaningful changes and generate a report.

Constraints:

10m resolution (a truck is ~1 pixel, a building is 5-20 pixels)
Cloud cover frequently obscures images
Seasonal vegetation changes create false positives
Need to distinguish construction from temporary changes (parked vehicles, shadows)
Alert within 24 hours of image availability

Pause — what would you do?

Think about: change detection approach, handling clouds, handling seasonal variation, what counts as a “meaningful” change.

Analysis

Approach: multi-temporal composite + NDVI + structural change

Cloud filtering: Use Sentinel-2 SCL (Scene Classification Layer) to mask cloudy pixels. Require < 20% cloud cover over the AOI. When a cloudy image arrives, skip it and wait for the next clear acquisition.
Temporal reference: Don’t compare image-to-image (too noisy). Build a rolling reference composite from the last 6 clear images (median composite). This smooths out ephemeral changes (shadows, temporary vehicles) and seasonal variation.
Change detection: Compute pixel-level difference between new image and reference composite for:
- RGB bands: catches structural changes (new buildings appear as bright objects)
- NDVI: catches vegetation removal (site clearing shows as NDVI drop)
- SWIR: catches soil disturbance (different SWIR reflectance)
Thresholding and filtering:
- Require change in multiple bands (not just one) to reduce false positives
- Apply minimum area filter: change region must be > 500m^2 (5 pixels) to be reported
- Exclude known agricultural areas where seasonal change is expected
Classification: For flagged change regions, classify the type:
- NDVI drop + high brightness = likely construction
- NDVI change only = likely seasonal or agricultural
- New dark region = possibly water or shadow

What about deep learning? At 10m resolution with limited training data for this specific facility, classical methods (thresholding, compositing) are more reliable and interpretable than a CNN. Deep learning change detection (e.g., siamese networks) shines at higher resolution (<1m) where you can detect individual vehicles and equipment.

Operational workflow:

New Sentinel-2 image available
→ Cloud check → if too cloudy, skip
→ Co-register with reference
→ Compute change bands
→ Threshold → candidate change regions
→ Filter by area, type, location
→ Generate report with before/after crops
→ Human analyst reviews flagged changes

Key tradeoff

Sensitivity vs false positive rate. High sensitivity catches real changes but also flags cloud shadows, seasonal effects, and sensor artifacts. A good operational system errs toward sensitivity and relies on human review to filter false positives — missing a real change is worse than reviewing a few extra images.

Scenario 3: Indoor activity monitoring

Problem

Fixed cameras in a building lobby. The task: detect and classify human activities in real-time. Activities of interest: normal walking, running, loitering (standing in one spot > 2 min), package deposit (person leaves an object), aggressive behavior (sudden movements, fighting).

Constraints:

4 cameras, each 1080p at 25 FPS
Server with RTX 4070 (12GB) processing all feeds
Privacy considerations: faces must not be stored
Must work 24/7 with < 1% false alarm rate for “alert” events
Lighting varies (lobby lighting, sunlight through windows, night mode)

Pause — what would you do?

Think about: pose-based vs appearance-based, how to detect each activity type, privacy approach, multi-camera handling.

Analysis

Architecture: pose-based activity recognition

Using Pose Estimation rather than raw appearance for several reasons:

Privacy: skeleton data is inherently anonymized (no faces, clothing, identity)
Lighting robustness: skeleton detection is more robust to illumination than appearance-based classification
Generalization: actions are defined by body movement, not visual appearance

Pipeline per camera:

Detection + Tracking: YOLO11s for person detection + ByteTrack. This gives tracked person bounding boxes with consistent IDs.
Pose estimation: RTMPose (top-down, one person at a time in each crop). ~10ms per person. With 4 cameras and ~10 people per camera, this is ~400ms/frame — need to batch efficiently or skip frames for pose (every 3rd frame, interpolate between).
Activity classification: For each tracked person, maintain a buffer of the last 60 frames (2.4 sec) of skeleton data. Feed into a lightweight temporal classifier.

Activity	Detection method
Walking	Forward velocity > threshold from track trajectory
Running	High velocity + characteristic pose features (limb extension)
Loitering	Track position variance < threshold for > 120 seconds
Package deposit	Person carries object (wider silhouette) → object disappears from person → new static object in scene
Aggressive behavior	High skeleton velocity variance + specific pose patterns (raised arms, sudden movements)

Package deposit detection is the hardest. It requires:

Detecting objects carried by people (challenging — they’re small and partially occluded)
Detecting when an object becomes “abandoned” (not carried by anyone)
Approach: background subtraction to detect new static objects, cross-reference with person tracks

Privacy implementation:

Never store raw video to disk
Process in real-time, store only skeleton data and event logs
Face blurring on any saved clips (for review of flagged events)
GDPR compliance: clear signage, data retention policy

Multi-camera considerations:

If cameras have overlapping FOVs: re-identification across cameras (person in camera 1 = person in camera 3) using skeleton similarity + spatial reasoning
If non-overlapping: treat independently, flag events per camera

Compute budget (per frame, 4 cameras):

Component	Time
YOLO detection (4 cameras, batched)	~15ms
ByteTrack (4 cameras)	~4ms
RTMPose (~40 people)	~100ms
Activity classification	~5ms
Total	~124ms

This runs at ~8 FPS on a single GPU. If you need 25 FPS: process detection at full rate, pose every 3rd frame (interpolate skeleton), classify continuously from skeleton buffer.

Key tradeoff

Pose-based is more private and generalizes better. Appearance-based (3D CNN on raw video) is more accurate for subtle activities. For this scenario, privacy requirements make pose-based the right choice, and the target activities (loitering, running, aggression) have strong skeletal signatures.

General design principles

Start with the constraint that hurts most: latency? compute? accuracy? privacy? That dictates architecture.
Decompose into stages: detection → tracking → classification. Each stage can be debugged and improved independently.
Use the simplest approach that meets requirements: don’t use a 3D CNN when skeleton + rules suffice.
Measure early: get ground truth, compute metrics, find failure modes. Most time should be spent on the 20% of cases that fail.
Edge cases kill you: the system works 95% of the time in the demo. The remaining 5% (rain, night, crowd, camera shake) is where the engineering effort goes.

Self-test questions

Why is camera motion compensation critical for drone-based tracking?
When would you choose classical change detection over deep learning for satellite imagery?
What are the advantages of pose-based activity recognition over appearance-based for surveillance?
How would you handle the compute budget if the system can’t run at full frame rate?
What is the difference between sensitivity and false positive rate, and which matters more for security applications?

AI/ML Notes

Explorer

Case Study - CV Pipeline Design

Case Study - CV Pipeline Design

Scenario 1: Drone surveillance feed

Problem

Pause — what would you do?

Analysis

Key tradeoff

Scenario 2: Satellite change detection

Problem

Pause — what would you do?

Analysis

Key tradeoff

Scenario 3: Indoor activity monitoring

Problem

Pause — what would you do?

Analysis

Key tradeoff

General design principles

Self-test questions

Links

Graph View

Table of Contents

Backlinks