Pose Estimation
What
Detect human body keypoints (joints) in images or video. The output is a set of 2D or 3D coordinates for each joint — shoulders, elbows, wrists, hips, knees, ankles, etc.
This gives you the skeleton of a person without needing to understand their appearance. From the skeleton alone you can determine posture, action, gesture, and intent.
Keypoint representations
COCO 17-keypoint format
The standard. Used by most models and datasets.
Keypoints: nose, left_eye, right_eye, left_ear, right_ear,
left_shoulder, right_shoulder, left_elbow, right_elbow,
left_wrist, right_wrist, left_hip, right_hip,
left_knee, right_knee, left_ankle, right_ankle
Skeleton connections:
nose -> left_eye -> left_ear
nose -> right_eye -> right_ear
left_shoulder -> left_elbow -> left_wrist
right_shoulder -> right_elbow -> right_wrist
left_shoulder -> right_shoulder (torso top)
left_shoulder -> left_hip
right_shoulder -> right_hip
left_hip -> right_hip (torso bottom)
left_hip -> left_knee -> left_ankle
right_hip -> right_knee -> right_ankle
Each keypoint has: (x, y, visibility). Visibility: 0 = not labeled, 1 = labeled but occluded, 2 = labeled and visible.
Approaches
Top-down
- Detect people with an object detector (bounding boxes)
- Crop each person
- Run pose estimator on each crop
Pros: high accuracy per person (model focuses on one person at a time). Cons: runtime scales with number of people. Depends on detector quality.
Bottom-up
- Detect all keypoints in the image at once (all people)
- Group keypoints into individual skeletons
Pros: runtime independent of number of people. Cons: harder to get right, especially in crowds.
How heatmap detection works
Most pose models predict a heatmap for each keypoint type — a probability map the same size as the feature map where high values indicate keypoint locations.
Input image (256x192) → CNN backbone → Feature maps
→ Deconv/upsampling → 17 heatmaps (64x48 each)
→ Find peak in each heatmap → keypoint coordinates
The heatmap is typically a 2D Gaussian centered on the ground-truth keypoint location during training. At inference, argmax of each heatmap gives the keypoint position.
Key models
| Model | Type | Notes |
|---|---|---|
| OpenPose (2017) | Bottom-up | First real-time multi-person. Uses Part Affinity Fields to group keypoints |
| HRNet (2019) | Top-down | Maintains high-resolution features throughout. Very accurate |
| MediaPipe Pose | Top-down | Google, runs on mobile/browser. 33 keypoints including hands/face |
| ViTPose (2022) | Top-down | Vision Transformer backbone. State of the art on COCO |
| RTMPose (2023) | Top-down | Real-time, competitive accuracy. MMPose framework |
Quick demo: MediaPipe
The fastest way to get pose estimation running. Works on CPU.
import cv2
import mediapipe as mp
mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils
# Process a single image
pose = mp_pose.Pose(
static_image_mode=True,
model_complexity=2, # 0=lite, 1=full, 2=heavy
min_detection_confidence=0.5,
)
image = cv2.imread("person.jpg")
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pose.process(rgb)
if results.pose_landmarks:
# Draw skeleton on image
mp_draw.draw_landmarks(
image, results.pose_landmarks, mp_pose.POSE_CONNECTIONS
)
# Access individual keypoints
landmarks = results.pose_landmarks.landmark
left_shoulder = landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER]
print(f"Left shoulder: x={left_shoulder.x:.3f}, y={left_shoulder.y:.3f}, "
f"visibility={left_shoulder.visibility:.3f}")
# x,y are normalized [0,1] relative to image dimensions
cv2.imwrite("pose_output.jpg", image)
pose.close()Process video with MediaPipe
import cv2
import mediapipe as mp
mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils
cap = cv2.VideoCapture("video.mp4")
pose = mp_pose.Pose(
static_image_mode=False, # video mode: use temporal smoothing
model_complexity=1,
min_detection_confidence=0.5,
min_tracking_confidence=0.5,
)
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = pose.process(rgb)
if results.pose_landmarks:
mp_draw.draw_landmarks(
frame, results.pose_landmarks, mp_pose.POSE_CONNECTIONS
)
cv2.imshow("Pose", frame)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
pose.close()torchvision KeypointRCNN
For more control and batch processing.
import torch
import torchvision
from torchvision.models.detection import keypointrcnn_resnet50_fpn
from torchvision.transforms import functional as F
from PIL import Image
model = keypointrcnn_resnet50_fpn(weights="DEFAULT")
model.eval()
image = Image.open("person.jpg")
tensor = F.to_tensor(image)
with torch.no_grad():
predictions = model([tensor])[0]
# predictions contains:
# boxes: [N, 4] bounding boxes
# scores: [N] confidence scores
# keypoints: [N, 17, 3] -- x, y, visibility for each of 17 COCO keypoints
for i in range(len(predictions["scores"])):
if predictions["scores"][i] > 0.9:
kps = predictions["keypoints"][i] # (17, 3)
print(f"Person {i}: {kps.shape[0]} keypoints detected")
# kps[:, 0] = x coords, kps[:, 1] = y coords, kps[:, 2] = scoreAction classification from pose sequences
Once you have keypoints, you can classify what a person is doing without any appearance information. This is powerful for privacy-preserving surveillance.
import numpy as np
def extract_pose_features(keypoints_sequence):
"""Convert a sequence of pose keypoints into features for classification.
keypoints_sequence: list of (17, 2) arrays -- one per frame.
Returns: feature vector.
"""
features = []
for kps in keypoints_sequence:
# Normalize: center on hip midpoint, scale by torso length
hip_center = (kps[11] + kps[12]) / 2 # left_hip + right_hip
torso_len = np.linalg.norm(kps[5] - kps[11]) # shoulder to hip
if torso_len < 1e-6:
torso_len = 1.0
normalized = (kps - hip_center) / torso_len
features.append(normalized.flatten())
features = np.array(features) # (n_frames, 34)
# Add velocity features (frame-to-frame differences)
if len(features) > 1:
velocities = np.diff(features, axis=0)
return np.concatenate([
features.mean(axis=0), # average pose
features.std(axis=0), # pose variation
velocities.mean(axis=0), # average movement
velocities.std(axis=0), # movement variation
])
return np.concatenate([features.mean(axis=0), features.std(axis=0)])
# Simple classifier: standing vs walking vs running
# In practice: train a small MLP or LSTM on these featuresTemporal extension: pose tracking
Combining pose estimation with tracking to follow specific people’s movements over time. This is where Multi-Object Tracking and pose estimation intersect.
Approaches:
- Top-down: track person bounding boxes, estimate pose per tracked person
- Bottom-up: detect all keypoints, group into people, then track people by skeleton similarity
- Joint: models like PoseTrack that do detection, pose, and tracking together
Applications
- Action recognition: classify what people are doing from skeleton alone (privacy-preserving)
- Gesture detection: hand/arm signals for human-machine interaction or military hand signals
- Behavior analysis: detect suspicious behavior in surveillance (loitering, fighting, package drop)
- Exercise form: auto-coach for fitness (joint angles for squat depth, etc.)
- Drone control: gesture-based control of FPV drones
- Ergonomics: workplace safety monitoring
Self-test questions
- What is the difference between top-down and bottom-up pose estimation, and when would you prefer each?
- Why do pose models output heatmaps rather than directly regressing keypoint coordinates?
- How can you classify human actions using only skeleton data, with no appearance features?
- What is the COCO keypoint format, and what does the visibility flag encode?
- How does temporal smoothing improve pose estimation in video compared to per-frame processing?
Exercises
- MediaPipe demo: Detect pose in 3 different images (standing person, sitting, action shot). Draw skeletons and print all keypoint coordinates.
- Action classifier: Collect 30-second videos of yourself doing 3 actions (standing, walking, waving). Extract pose per frame with MediaPipe, build feature vectors, train a simple sklearn classifier (SVM or RandomForest).
- Pose tracking: Process a video with multiple people. Track each person’s skeleton over time by combining bounding box tracking with per-person pose estimation.
Links
- Multi-Object Tracking — tracking people to apply per-person pose
- Convolutional Neural Networks — backbone architectures for pose models
- Video Understanding — pose sequences feed into temporal models
- Tutorial - Object Tracking Pipeline — tracking fundamentals