Case Study - CV Pipeline Design
Three scenarios that require computer vision engineering judgment. For each: understand the problem, think about what you would do, then read the analysis.
Scenario 1: Drone surveillance feed
Problem
You are processing real-time video from an FPV drone flying at 50-100m altitude over a road network. The task: track all vehicles in the field of view, report their positions and trajectories, and flag any that stop or reverse direction.
Constraints:
- Jetson Orin Nano onboard (8GB, ~40 TOPS INT8)
- 720p video at 30 FPS
- Latency requirement: < 200ms end-to-end
- The drone moves, so the background moves too
- Vehicles appear small (20-60 pixels wide)
- Must work day and night (IR camera available)
Pause — what would you do?
Think about: detection model, tracker choice, how to handle camera motion, day/night strategy.
Analysis
Detection model: YOLO11n or YOLO11s quantized to INT8 for Jetson. The nano model runs ~100 FPS on Jetson Orin at 640px input. For small objects at altitude, you have two options:
- Run detection at full resolution with small anchor tuning
- Run SAHI (Slicing Aided Hyper Inference): tile the image into overlapping patches, detect per patch, merge results. Slower but catches small objects.
Recommendation: start with full-res YOLO11s-INT8. If small vehicle recall is too low, add SAHI with 2x2 tiling.
Tracker: ByteTrack. It’s fast (< 1ms per frame), handles brief occlusions via low-confidence detections, and has no deep appearance model to slow things down. At this altitude, vehicles are small and appearances are not distinctive enough for Re-ID anyway.
Camera motion compensation: The drone moves, so IoU-based tracking fails (the same vehicle shifts position between frames even when stationary). Solution:
- Estimate inter-frame homography (cv2.findHomography on matched ORB features)
- Warp previous track positions to current frame coordinates
- Then do IoU matching in the stabilized coordinate system
This is critical. Without it, every drone camera pan creates false ID switches.
Day/night: Train on both RGB and IR images. YOLO handles IR well if included in training data. For night, switch to IR input automatically based on ambient light sensor.
Anomaly detection (stopped/reversing vehicles): compute vehicle velocity from track trajectory (smoothed over 10 frames). Flag velocity < threshold (stopped) or velocity direction reversal. This is simple once tracking works.
Latency budget:
| Component | Time |
|---|---|
| Frame capture | ~5ms |
| YOLO detection | ~15ms |
| Homography estimation | ~3ms |
| ByteTrack | ~1ms |
| Anomaly check | ~1ms |
| Visualization | ~5ms |
| Total | ~30ms |
Well within the 200ms requirement.
Key tradeoff
Accuracy vs latency. If you need to catch every small vehicle, you need SAHI tiling, which 3-4x the detection time. For most operational scenarios, missing a few distant vehicles is acceptable; catching vehicles in the near field (where they’re larger) matters more.
Scenario 2: Satellite change detection
Problem
Monitor a military facility for new construction or significant changes. You receive Sentinel-2 imagery every 5 days at 10m resolution. The task: automatically flag meaningful changes and generate a report.
Constraints:
- 10m resolution (a truck is ~1 pixel, a building is 5-20 pixels)
- Cloud cover frequently obscures images
- Seasonal vegetation changes create false positives
- Need to distinguish construction from temporary changes (parked vehicles, shadows)
- Alert within 24 hours of image availability
Pause — what would you do?
Think about: change detection approach, handling clouds, handling seasonal variation, what counts as a “meaningful” change.
Analysis
Approach: multi-temporal composite + NDVI + structural change
-
Cloud filtering: Use Sentinel-2 SCL (Scene Classification Layer) to mask cloudy pixels. Require < 20% cloud cover over the AOI. When a cloudy image arrives, skip it and wait for the next clear acquisition.
-
Temporal reference: Don’t compare image-to-image (too noisy). Build a rolling reference composite from the last 6 clear images (median composite). This smooths out ephemeral changes (shadows, temporary vehicles) and seasonal variation.
-
Change detection: Compute pixel-level difference between new image and reference composite for:
- RGB bands: catches structural changes (new buildings appear as bright objects)
- NDVI: catches vegetation removal (site clearing shows as NDVI drop)
- SWIR: catches soil disturbance (different SWIR reflectance)
-
Thresholding and filtering:
- Require change in multiple bands (not just one) to reduce false positives
- Apply minimum area filter: change region must be > 500m^2 (5 pixels) to be reported
- Exclude known agricultural areas where seasonal change is expected
-
Classification: For flagged change regions, classify the type:
- NDVI drop + high brightness = likely construction
- NDVI change only = likely seasonal or agricultural
- New dark region = possibly water or shadow
What about deep learning? At 10m resolution with limited training data for this specific facility, classical methods (thresholding, compositing) are more reliable and interpretable than a CNN. Deep learning change detection (e.g., siamese networks) shines at higher resolution (<1m) where you can detect individual vehicles and equipment.
Operational workflow:
New Sentinel-2 image available
→ Cloud check → if too cloudy, skip
→ Co-register with reference
→ Compute change bands
→ Threshold → candidate change regions
→ Filter by area, type, location
→ Generate report with before/after crops
→ Human analyst reviews flagged changes
Key tradeoff
Sensitivity vs false positive rate. High sensitivity catches real changes but also flags cloud shadows, seasonal effects, and sensor artifacts. A good operational system errs toward sensitivity and relies on human review to filter false positives — missing a real change is worse than reviewing a few extra images.
Scenario 3: Indoor activity monitoring
Problem
Fixed cameras in a building lobby. The task: detect and classify human activities in real-time. Activities of interest: normal walking, running, loitering (standing in one spot > 2 min), package deposit (person leaves an object), aggressive behavior (sudden movements, fighting).
Constraints:
- 4 cameras, each 1080p at 25 FPS
- Server with RTX 4070 (12GB) processing all feeds
- Privacy considerations: faces must not be stored
- Must work 24/7 with < 1% false alarm rate for “alert” events
- Lighting varies (lobby lighting, sunlight through windows, night mode)
Pause — what would you do?
Think about: pose-based vs appearance-based, how to detect each activity type, privacy approach, multi-camera handling.
Analysis
Architecture: pose-based activity recognition
Using Pose Estimation rather than raw appearance for several reasons:
- Privacy: skeleton data is inherently anonymized (no faces, clothing, identity)
- Lighting robustness: skeleton detection is more robust to illumination than appearance-based classification
- Generalization: actions are defined by body movement, not visual appearance
Pipeline per camera:
-
Detection + Tracking: YOLO11s for person detection + ByteTrack. This gives tracked person bounding boxes with consistent IDs.
-
Pose estimation: RTMPose (top-down, one person at a time in each crop). ~10ms per person. With 4 cameras and ~10 people per camera, this is ~400ms/frame — need to batch efficiently or skip frames for pose (every 3rd frame, interpolate between).
-
Activity classification: For each tracked person, maintain a buffer of the last 60 frames (2.4 sec) of skeleton data. Feed into a lightweight temporal classifier.
| Activity | Detection method |
|---|---|
| Walking | Forward velocity > threshold from track trajectory |
| Running | High velocity + characteristic pose features (limb extension) |
| Loitering | Track position variance < threshold for > 120 seconds |
| Package deposit | Person carries object (wider silhouette) → object disappears from person → new static object in scene |
| Aggressive behavior | High skeleton velocity variance + specific pose patterns (raised arms, sudden movements) |
Package deposit detection is the hardest. It requires:
- Detecting objects carried by people (challenging — they’re small and partially occluded)
- Detecting when an object becomes “abandoned” (not carried by anyone)
- Approach: background subtraction to detect new static objects, cross-reference with person tracks
Privacy implementation:
- Never store raw video to disk
- Process in real-time, store only skeleton data and event logs
- Face blurring on any saved clips (for review of flagged events)
- GDPR compliance: clear signage, data retention policy
Multi-camera considerations:
- If cameras have overlapping FOVs: re-identification across cameras (person in camera 1 = person in camera 3) using skeleton similarity + spatial reasoning
- If non-overlapping: treat independently, flag events per camera
Compute budget (per frame, 4 cameras):
| Component | Time |
|---|---|
| YOLO detection (4 cameras, batched) | ~15ms |
| ByteTrack (4 cameras) | ~4ms |
| RTMPose (~40 people) | ~100ms |
| Activity classification | ~5ms |
| Total | ~124ms |
This runs at ~8 FPS on a single GPU. If you need 25 FPS: process detection at full rate, pose every 3rd frame (interpolate skeleton), classify continuously from skeleton buffer.
Key tradeoff
Pose-based is more private and generalizes better. Appearance-based (3D CNN on raw video) is more accurate for subtle activities. For this scenario, privacy requirements make pose-based the right choice, and the target activities (loitering, running, aggression) have strong skeletal signatures.
General design principles
- Start with the constraint that hurts most: latency? compute? accuracy? privacy? That dictates architecture.
- Decompose into stages: detection → tracking → classification. Each stage can be debugged and improved independently.
- Use the simplest approach that meets requirements: don’t use a 3D CNN when skeleton + rules suffice.
- Measure early: get ground truth, compute metrics, find failure modes. Most time should be spent on the 20% of cases that fail.
- Edge cases kill you: the system works 95% of the time in the demo. The remaining 5% (rain, night, crowd, camera shake) is where the engineering effort goes.
Self-test questions
- Why is camera motion compensation critical for drone-based tracking?
- When would you choose classical change detection over deep learning for satellite imagery?
- What are the advantages of pose-based activity recognition over appearance-based for surveillance?
- How would you handle the compute budget if the system can’t run at full frame rate?
- What is the difference between sensitivity and false positive rate, and which matters more for security applications?
Links
- Multi-Object Tracking — tracking theory and algorithms
- Pose Estimation — skeleton-based activity recognition
- Tutorial - Aerial Image Analysis — satellite change detection
- Tutorial - Object Tracking Pipeline — tracking implementation
- Video Understanding — temporal analysis approaches
- Object Detection — detection stage
- Optical Flow — motion estimation
- 3D Vision and Depth — depth for spatial reasoning