Video Understanding
What
Understand temporal dynamics in video — not just what’s visible in a single frame, but what’s happening over time. A frame tells you there’s a person near a car. A video tells you the person is breaking into it.
This requires modeling the temporal dimension: how things change, move, and interact across frames.
Levels of video understanding
| Level | Scope | Example task |
|---|---|---|
| Frame-level | Single frame | Object detection per frame |
| Clip-level | Short segment (2-16 sec) | Action recognition (running, fighting) |
| Video-level | Full video (minutes-hours) | Activity detection, summarization |
Most current models work at the clip level. Long-form video understanding remains an open problem.
Architectures
Two-stream networks (2014)
The foundational idea: video has two complementary information streams.
- Spatial stream: single RGB frame → what’s in the scene (appearance)
- Temporal stream: stacked optical flow frames → how things move (motion)
Each stream is a standard CNN. Their predictions are fused (late fusion: average/concatenate class scores).
┌─ Spatial CNN (RGB frame) ──────┐
Video ──>│ │──> Fused prediction
└─ Temporal CNN (optical flow) ──┘
Why it works: appearance and motion are complementary. A person standing still looks different from a person running (spatial). But the spatial stream alone can’t distinguish “waving” from “stretching” — the motion pattern matters.
See Optical Flow for computing the temporal stream input.
3D convolutions: C3D, I3D
Instead of processing frames independently and fusing later, convolve across time directly.
A 2D conv kernel is (k, k) — spatial only. A 3D conv kernel is (t, k, k) — spatiotemporal. It captures short-range temporal patterns within the convolution window.
| Model | Architecture | Key idea |
|---|---|---|
| C3D (2015) | 3D VGG-like | First 3D CNN for video, fixed 16-frame clips |
| I3D (2017) | Inflated Inception | ”Inflate” pretrained 2D weights to 3D. Much better than training from scratch |
| SlowFast (2019) | Two pathways | Slow path: low frame rate, spatial detail. Fast path: high frame rate, temporal detail |
| R(2+1)D | Factored 3D conv | Split 3D conv into spatial 2D + temporal 1D. Easier to optimize |
Video transformers
Transformers naturally handle sequences, making them a good fit for video.
| Model | Key idea |
|---|---|
| TimeSformer (2021) | Divided attention: spatial attention + temporal attention (not full spatiotemporal) |
| ViViT (2021) | Various factorizations of spatiotemporal attention |
| VideoMAE (2022) | Masked autoencoder for video: mask 90% of spatiotemporal patches, reconstruct. Excellent self-supervised pretraining |
# Action recognition with torchvision video model
import torch
from torchvision.models.video import r3d_18, R3D_18_Weights
weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()
preprocess = weights.transforms()
# Input: video clip as tensor (T, C, H, W) -- e.g., 16 frames
# For demo, create random input
clip = torch.randint(0, 255, (16, 3, 112, 112), dtype=torch.uint8)
input_tensor = preprocess(clip).unsqueeze(0) # (1, C, T, H, W)
with torch.no_grad():
output = model(input_tensor)
pred = output.argmax(1).item()
categories = weights.meta["categories"]
print(f"Predicted action: {categories[pred]}")Loading real video clips
import cv2
import torch
import numpy as np
def load_video_clip(path, n_frames=16, size=(112, 112)):
"""Load a video clip as a tensor for action recognition.
Returns: (n_frames, 3, H, W) uint8 tensor.
"""
cap = cv2.VideoCapture(path)
total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
# Sample n_frames evenly spaced
indices = np.linspace(0, total - 1, n_frames, dtype=int)
frames = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if not ret:
break
frame = cv2.resize(frame, size)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame)
cap.release()
# (T, H, W, C) -> (T, C, H, W)
clip = np.stack(frames)
clip = torch.from_numpy(clip).permute(0, 3, 1, 2)
return clipTemporal action detection
Action recognition classifies short clips (“this is a fight”). Temporal action detection locates when actions occur in long untrimmed video (“fight from 02:14 to 02:31”).
Pipeline:
- Extract features per snippet (e.g., I3D features for 16-frame windows)
- Temporal model predicts action classes + boundaries across the full timeline
- Post-processing: threshold, merge adjacent segments, NMS
This is like object detection but in 1D (time axis) instead of 2D (spatial).
Anomaly detection in video
Detect unusual events in surveillance video without explicit labels for anomalies (because you can’t enumerate all possible anomalies).
Approach: learn normal, detect deviations
- Train on normal video only: learn what normal looks like (autoencoder, prediction model)
- At test time: high reconstruction error or prediction error = anomaly
import torch
import torch.nn as nn
class FramePredictorSimple(nn.Module):
"""Predict next frame from previous frames.
High prediction error = something unusual is happening.
"""
def __init__(self, n_input_frames=4):
super().__init__()
# Simple conv model: input is n stacked grayscale frames
self.encoder = nn.Sequential(
nn.Conv2d(n_input_frames, 32, 3, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(32, 64, 3, stride=2, padding=1),
nn.ReLU(),
)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1),
nn.ReLU(),
nn.ConvTranspose2d(32, 1, 3, stride=2, padding=1, output_padding=1),
nn.Sigmoid(),
)
def forward(self, x):
"""x: (batch, n_input_frames, H, W) grayscale frames.
Returns: predicted next frame (batch, 1, H, W).
"""
return self.decoder(self.encoder(x))
# Training: minimize MSE between predicted and actual next frame on normal video
# Inference: compute MSE per frame. Spike in error = anomalyMethods:
- Reconstruction-based: autoencoder trained on normal video. Anomalies have high reconstruction error.
- Prediction-based: predict next frame from previous. Anomalies are unpredictable.
- Feature-based: extract features per frame, model normal feature distribution, flag outliers (One-Class SVM, GMM).
Applications
- Surveillance activity detection: identify specific activities (package delivery, loitering, intrusion) in camera feeds
- Drone video analysis: classify terrain, detect activities, assess damage from aerial video
- Anomaly detection: detect unusual events in industrial, traffic, or security cameras
- Sports analytics: track plays, classify actions, generate highlights
- Content moderation: detect violent or prohibited content in video
Defense/security specific
- UAV ISR: automated analysis of surveillance drone video. Flag events of interest, reducing operator workload
- Pattern of life analysis: video-level understanding of routines at a location. Deviations may indicate activity
- Force protection: real-time detection of threats (vehicle approach, perimeter breach) from fixed cameras
Building a simple video classifier
import torch
import torch.nn as nn
from torchvision.models import resnet18
class FrameAverageClassifier(nn.Module):
"""Simplest possible video classifier:
extract features per frame with CNN, average them, classify.
"""
def __init__(self, n_classes):
super().__init__()
backbone = resnet18(weights="DEFAULT")
self.features = nn.Sequential(*list(backbone.children())[:-1]) # remove FC
self.classifier = nn.Linear(512, n_classes)
def forward(self, clip):
"""clip: (batch, T, C, H, W)"""
B, T, C, H, W = clip.shape
# Process all frames at once
frames = clip.reshape(B * T, C, H, W)
feats = self.features(frames).squeeze(-1).squeeze(-1) # (B*T, 512)
feats = feats.reshape(B, T, -1) # (B, T, 512)
# Average pool across time
pooled = feats.mean(dim=1) # (B, 512)
return self.classifier(pooled)
# This is a baseline. Better: use temporal attention or 3D convolutions.Self-test questions
- Why is a single RGB frame insufficient for many video understanding tasks? Give an example.
- What is the key difference between two-stream networks and 3D convolutions?
- How does temporal action detection differ from action recognition?
- Why is anomaly detection typically trained on normal data only?
- What is the advantage of SlowFast’s two-pathway design over a single 3D CNN?
Exercises
- Action recognition: Use torchvision’s R3D-18 model to classify 5 video clips (download from Kinetics or record your own). Report predicted class and confidence.
- Frame-level baseline: Build the FrameAverageClassifier above, train on UCF-101 (small subset), compare accuracy with a proper 3D model.
- Anomaly detection: Record 5 minutes of “normal” activity from a webcam. Train a simple frame predictor. Then introduce anomalies (sudden movement, new object) and plot prediction error over time.
Links
- Optical Flow — temporal stream for two-stream networks
- Transformers — video transformers extend ViT to spatiotemporal
- Multi-Object Tracking — tracking provides identity across video
- Pose Estimation — skeleton-based action recognition
- Image Classification — frame-level understanding baseline
- Case Study - CV Pipeline Design — choosing video analysis approaches