Pose Estimation

What

Detect human body keypoints (joints) in images or video. The output is a set of 2D or 3D coordinates for each joint — shoulders, elbows, wrists, hips, knees, ankles, etc.

This gives you the skeleton of a person without needing to understand their appearance. From the skeleton alone you can determine posture, action, gesture, and intent.

Keypoint representations

COCO 17-keypoint format

The standard. Used by most models and datasets.

Keypoints: nose, left_eye, right_eye, left_ear, right_ear,
           left_shoulder, right_shoulder, left_elbow, right_elbow,
           left_wrist, right_wrist, left_hip, right_hip,
           left_knee, right_knee, left_ankle, right_ankle

Skeleton connections:
  nose -> left_eye -> left_ear
  nose -> right_eye -> right_ear
  left_shoulder -> left_elbow -> left_wrist
  right_shoulder -> right_elbow -> right_wrist
  left_shoulder -> right_shoulder (torso top)
  left_shoulder -> left_hip
  right_shoulder -> right_hip
  left_hip -> right_hip (torso bottom)
  left_hip -> left_knee -> left_ankle
  right_hip -> right_knee -> right_ankle

Each keypoint has: (x, y, visibility). Visibility: 0 = not labeled, 1 = labeled but occluded, 2 = labeled and visible.

Approaches

Top-down

  1. Detect people with an object detector (bounding boxes)
  2. Crop each person
  3. Run pose estimator on each crop

Pros: high accuracy per person (model focuses on one person at a time). Cons: runtime scales with number of people. Depends on detector quality.

Bottom-up

  1. Detect all keypoints in the image at once (all people)
  2. Group keypoints into individual skeletons

Pros: runtime independent of number of people. Cons: harder to get right, especially in crowds.

How heatmap detection works

Most pose models predict a heatmap for each keypoint type — a probability map the same size as the feature map where high values indicate keypoint locations.

Input image (256x192) → CNN backbone → Feature maps
→ Deconv/upsampling → 17 heatmaps (64x48 each)
→ Find peak in each heatmap → keypoint coordinates

The heatmap is typically a 2D Gaussian centered on the ground-truth keypoint location during training. At inference, argmax of each heatmap gives the keypoint position.

Key models

ModelTypeNotes
OpenPose (2017)Bottom-upFirst real-time multi-person. Uses Part Affinity Fields to group keypoints
HRNet (2019)Top-downMaintains high-resolution features throughout. Very accurate
MediaPipe PoseTop-downGoogle, runs on mobile/browser. 33 keypoints including hands/face
ViTPose (2022)Top-downVision Transformer backbone. State of the art on COCO
RTMPose (2023)Top-downReal-time, competitive accuracy. MMPose framework

Quick demo: MediaPipe

The fastest way to get pose estimation running. Works on CPU.

import cv2
import mediapipe as mp
 
mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils
 
# Process a single image
pose = mp_pose.Pose(
    static_image_mode=True,
    model_complexity=2,       # 0=lite, 1=full, 2=heavy
    min_detection_confidence=0.5,
)
 
image = cv2.imread("person.jpg")
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pose.process(rgb)
 
if results.pose_landmarks:
    # Draw skeleton on image
    mp_draw.draw_landmarks(
        image, results.pose_landmarks, mp_pose.POSE_CONNECTIONS
    )
 
    # Access individual keypoints
    landmarks = results.pose_landmarks.landmark
    left_shoulder = landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER]
    print(f"Left shoulder: x={left_shoulder.x:.3f}, y={left_shoulder.y:.3f}, "
          f"visibility={left_shoulder.visibility:.3f}")
    # x,y are normalized [0,1] relative to image dimensions
 
cv2.imwrite("pose_output.jpg", image)
pose.close()

Process video with MediaPipe

import cv2
import mediapipe as mp
 
mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils
 
cap = cv2.VideoCapture("video.mp4")
pose = mp_pose.Pose(
    static_image_mode=False,  # video mode: use temporal smoothing
    model_complexity=1,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5,
)
 
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
 
    rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = pose.process(rgb)
 
    if results.pose_landmarks:
        mp_draw.draw_landmarks(
            frame, results.pose_landmarks, mp_pose.POSE_CONNECTIONS
        )
 
    cv2.imshow("Pose", frame)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break
 
cap.release()
cv2.destroyAllWindows()
pose.close()

torchvision KeypointRCNN

For more control and batch processing.

import torch
import torchvision
from torchvision.models.detection import keypointrcnn_resnet50_fpn
from torchvision.transforms import functional as F
from PIL import Image
 
model = keypointrcnn_resnet50_fpn(weights="DEFAULT")
model.eval()
 
image = Image.open("person.jpg")
tensor = F.to_tensor(image)
 
with torch.no_grad():
    predictions = model([tensor])[0]
 
# predictions contains:
#   boxes: [N, 4] bounding boxes
#   scores: [N] confidence scores
#   keypoints: [N, 17, 3] -- x, y, visibility for each of 17 COCO keypoints
 
for i in range(len(predictions["scores"])):
    if predictions["scores"][i] > 0.9:
        kps = predictions["keypoints"][i]  # (17, 3)
        print(f"Person {i}: {kps.shape[0]} keypoints detected")
        # kps[:, 0] = x coords, kps[:, 1] = y coords, kps[:, 2] = score

Action classification from pose sequences

Once you have keypoints, you can classify what a person is doing without any appearance information. This is powerful for privacy-preserving surveillance.

import numpy as np
 
def extract_pose_features(keypoints_sequence):
    """Convert a sequence of pose keypoints into features for classification.
    keypoints_sequence: list of (17, 2) arrays -- one per frame.
    Returns: feature vector.
    """
    features = []
    for kps in keypoints_sequence:
        # Normalize: center on hip midpoint, scale by torso length
        hip_center = (kps[11] + kps[12]) / 2  # left_hip + right_hip
        torso_len = np.linalg.norm(kps[5] - kps[11])  # shoulder to hip
        if torso_len < 1e-6:
            torso_len = 1.0
        normalized = (kps - hip_center) / torso_len
        features.append(normalized.flatten())
 
    features = np.array(features)  # (n_frames, 34)
 
    # Add velocity features (frame-to-frame differences)
    if len(features) > 1:
        velocities = np.diff(features, axis=0)
        return np.concatenate([
            features.mean(axis=0),   # average pose
            features.std(axis=0),    # pose variation
            velocities.mean(axis=0), # average movement
            velocities.std(axis=0),  # movement variation
        ])
    return np.concatenate([features.mean(axis=0), features.std(axis=0)])
 
# Simple classifier: standing vs walking vs running
# In practice: train a small MLP or LSTM on these features

Temporal extension: pose tracking

Combining pose estimation with tracking to follow specific people’s movements over time. This is where Multi-Object Tracking and pose estimation intersect.

Approaches:

  • Top-down: track person bounding boxes, estimate pose per tracked person
  • Bottom-up: detect all keypoints, group into people, then track people by skeleton similarity
  • Joint: models like PoseTrack that do detection, pose, and tracking together

Applications

  • Action recognition: classify what people are doing from skeleton alone (privacy-preserving)
  • Gesture detection: hand/arm signals for human-machine interaction or military hand signals
  • Behavior analysis: detect suspicious behavior in surveillance (loitering, fighting, package drop)
  • Exercise form: auto-coach for fitness (joint angles for squat depth, etc.)
  • Drone control: gesture-based control of FPV drones
  • Ergonomics: workplace safety monitoring

Self-test questions

  1. What is the difference between top-down and bottom-up pose estimation, and when would you prefer each?
  2. Why do pose models output heatmaps rather than directly regressing keypoint coordinates?
  3. How can you classify human actions using only skeleton data, with no appearance features?
  4. What is the COCO keypoint format, and what does the visibility flag encode?
  5. How does temporal smoothing improve pose estimation in video compared to per-frame processing?

Exercises

  1. MediaPipe demo: Detect pose in 3 different images (standing person, sitting, action shot). Draw skeletons and print all keypoint coordinates.
  2. Action classifier: Collect 30-second videos of yourself doing 3 actions (standing, walking, waving). Extract pose per frame with MediaPipe, build feature vectors, train a simple sklearn classifier (SVM or RandomForest).
  3. Pose tracking: Process a video with multiple people. Track each person’s skeleton over time by combining bounding box tracking with per-person pose estimation.