Video Understanding

What

Understand temporal dynamics in video — not just what’s visible in a single frame, but what’s happening over time. A frame tells you there’s a person near a car. A video tells you the person is breaking into it.

This requires modeling the temporal dimension: how things change, move, and interact across frames.

Levels of video understanding

LevelScopeExample task
Frame-levelSingle frameObject detection per frame
Clip-levelShort segment (2-16 sec)Action recognition (running, fighting)
Video-levelFull video (minutes-hours)Activity detection, summarization

Most current models work at the clip level. Long-form video understanding remains an open problem.

Architectures

Two-stream networks (2014)

The foundational idea: video has two complementary information streams.

  1. Spatial stream: single RGB frame → what’s in the scene (appearance)
  2. Temporal stream: stacked optical flow frames → how things move (motion)

Each stream is a standard CNN. Their predictions are fused (late fusion: average/concatenate class scores).

          ┌─ Spatial CNN (RGB frame) ──────┐
Video ──>│                                  │──> Fused prediction
          └─ Temporal CNN (optical flow) ──┘

Why it works: appearance and motion are complementary. A person standing still looks different from a person running (spatial). But the spatial stream alone can’t distinguish “waving” from “stretching” — the motion pattern matters.

See Optical Flow for computing the temporal stream input.

3D convolutions: C3D, I3D

Instead of processing frames independently and fusing later, convolve across time directly.

A 2D conv kernel is (k, k) — spatial only. A 3D conv kernel is (t, k, k) — spatiotemporal. It captures short-range temporal patterns within the convolution window.

ModelArchitectureKey idea
C3D (2015)3D VGG-likeFirst 3D CNN for video, fixed 16-frame clips
I3D (2017)Inflated Inception”Inflate” pretrained 2D weights to 3D. Much better than training from scratch
SlowFast (2019)Two pathwaysSlow path: low frame rate, spatial detail. Fast path: high frame rate, temporal detail
R(2+1)DFactored 3D convSplit 3D conv into spatial 2D + temporal 1D. Easier to optimize

Video transformers

Transformers naturally handle sequences, making them a good fit for video.

ModelKey idea
TimeSformer (2021)Divided attention: spatial attention + temporal attention (not full spatiotemporal)
ViViT (2021)Various factorizations of spatiotemporal attention
VideoMAE (2022)Masked autoencoder for video: mask 90% of spatiotemporal patches, reconstruct. Excellent self-supervised pretraining
# Action recognition with torchvision video model
import torch
from torchvision.models.video import r3d_18, R3D_18_Weights
 
weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()
 
preprocess = weights.transforms()
 
# Input: video clip as tensor (T, C, H, W) -- e.g., 16 frames
# For demo, create random input
clip = torch.randint(0, 255, (16, 3, 112, 112), dtype=torch.uint8)
input_tensor = preprocess(clip).unsqueeze(0)  # (1, C, T, H, W)
 
with torch.no_grad():
    output = model(input_tensor)
    pred = output.argmax(1).item()
 
categories = weights.meta["categories"]
print(f"Predicted action: {categories[pred]}")

Loading real video clips

import cv2
import torch
import numpy as np
 
def load_video_clip(path, n_frames=16, size=(112, 112)):
    """Load a video clip as a tensor for action recognition.
    Returns: (n_frames, 3, H, W) uint8 tensor.
    """
    cap = cv2.VideoCapture(path)
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
 
    # Sample n_frames evenly spaced
    indices = np.linspace(0, total - 1, n_frames, dtype=int)
    frames = []
 
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.resize(frame, size)
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(frame)
 
    cap.release()
 
    # (T, H, W, C) -> (T, C, H, W)
    clip = np.stack(frames)
    clip = torch.from_numpy(clip).permute(0, 3, 1, 2)
    return clip

Temporal action detection

Action recognition classifies short clips (“this is a fight”). Temporal action detection locates when actions occur in long untrimmed video (“fight from 02:14 to 02:31”).

Pipeline:

  1. Extract features per snippet (e.g., I3D features for 16-frame windows)
  2. Temporal model predicts action classes + boundaries across the full timeline
  3. Post-processing: threshold, merge adjacent segments, NMS

This is like object detection but in 1D (time axis) instead of 2D (spatial).

Anomaly detection in video

Detect unusual events in surveillance video without explicit labels for anomalies (because you can’t enumerate all possible anomalies).

Approach: learn normal, detect deviations

  1. Train on normal video only: learn what normal looks like (autoencoder, prediction model)
  2. At test time: high reconstruction error or prediction error = anomaly
import torch
import torch.nn as nn
 
class FramePredictorSimple(nn.Module):
    """Predict next frame from previous frames.
    High prediction error = something unusual is happening.
    """
    def __init__(self, n_input_frames=4):
        super().__init__()
        # Simple conv model: input is n stacked grayscale frames
        self.encoder = nn.Sequential(
            nn.Conv2d(n_input_frames, 32, 3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, stride=2, padding=1),
            nn.ReLU(),
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 1, 3, stride=2, padding=1, output_padding=1),
            nn.Sigmoid(),
        )
 
    def forward(self, x):
        """x: (batch, n_input_frames, H, W) grayscale frames.
        Returns: predicted next frame (batch, 1, H, W).
        """
        return self.decoder(self.encoder(x))
 
# Training: minimize MSE between predicted and actual next frame on normal video
# Inference: compute MSE per frame. Spike in error = anomaly

Methods:

  • Reconstruction-based: autoencoder trained on normal video. Anomalies have high reconstruction error.
  • Prediction-based: predict next frame from previous. Anomalies are unpredictable.
  • Feature-based: extract features per frame, model normal feature distribution, flag outliers (One-Class SVM, GMM).

Applications

  • Surveillance activity detection: identify specific activities (package delivery, loitering, intrusion) in camera feeds
  • Drone video analysis: classify terrain, detect activities, assess damage from aerial video
  • Anomaly detection: detect unusual events in industrial, traffic, or security cameras
  • Sports analytics: track plays, classify actions, generate highlights
  • Content moderation: detect violent or prohibited content in video

Defense/security specific

  • UAV ISR: automated analysis of surveillance drone video. Flag events of interest, reducing operator workload
  • Pattern of life analysis: video-level understanding of routines at a location. Deviations may indicate activity
  • Force protection: real-time detection of threats (vehicle approach, perimeter breach) from fixed cameras

Building a simple video classifier

import torch
import torch.nn as nn
from torchvision.models import resnet18
 
class FrameAverageClassifier(nn.Module):
    """Simplest possible video classifier:
    extract features per frame with CNN, average them, classify.
    """
    def __init__(self, n_classes):
        super().__init__()
        backbone = resnet18(weights="DEFAULT")
        self.features = nn.Sequential(*list(backbone.children())[:-1])  # remove FC
        self.classifier = nn.Linear(512, n_classes)
 
    def forward(self, clip):
        """clip: (batch, T, C, H, W)"""
        B, T, C, H, W = clip.shape
        # Process all frames at once
        frames = clip.reshape(B * T, C, H, W)
        feats = self.features(frames).squeeze(-1).squeeze(-1)  # (B*T, 512)
        feats = feats.reshape(B, T, -1)  # (B, T, 512)
        # Average pool across time
        pooled = feats.mean(dim=1)  # (B, 512)
        return self.classifier(pooled)
 
# This is a baseline. Better: use temporal attention or 3D convolutions.

Self-test questions

  1. Why is a single RGB frame insufficient for many video understanding tasks? Give an example.
  2. What is the key difference between two-stream networks and 3D convolutions?
  3. How does temporal action detection differ from action recognition?
  4. Why is anomaly detection typically trained on normal data only?
  5. What is the advantage of SlowFast’s two-pathway design over a single 3D CNN?

Exercises

  1. Action recognition: Use torchvision’s R3D-18 model to classify 5 video clips (download from Kinetics or record your own). Report predicted class and confidence.
  2. Frame-level baseline: Build the FrameAverageClassifier above, train on UCF-101 (small subset), compare accuracy with a proper 3D model.
  3. Anomaly detection: Record 5 minutes of “normal” activity from a webcam. Train a simple frame predictor. Then introduce anomalies (sudden movement, new object) and plot prediction error over time.