Tutorial - Visual SLAM Concepts
Goal
After this tutorial, you understand how Simultaneous Localization and Mapping (SLAM) works: estimating camera pose while building a map of the environment. You will implement a simple two-frame visual odometry pipeline.
Prerequisites: Optical Flow, 3D Vision and Depth, basic linear algebra.
Time: 90-120 minutes.
What is SLAM?
You drop a drone into an unknown building. It needs to:
- Localize: know where it is (position + orientation)
- Map: build a representation of the environment
These two problems are coupled — you need a map to localize, and you need to know your location to build a map. SLAM solves both simultaneously.
┌─────────────────────────────────────────┐
│ Camera frame t │
│ ├─ Detect features │
│ ├─ Match with previous frame │
│ ├─ Estimate motion (R, t) │
│ ├─ Triangulate new 3D points │
│ ├─ Update map │
│ └─ Optimize (bundle adjustment) │
│ │
│ Loop closure: "I've been here before!" │
│ → correct accumulated drift │
└─────────────────────────────────────────┘
Step 1: Feature detection and matching
The foundation of visual SLAM: find distinctive points in images and match them across frames.
ORB features (Oriented FAST and Rotated BRIEF)
ORB is fast, free (no patent), and good enough for real-time SLAM.
import cv2
import numpy as np
def detect_and_match(img1, img2, n_features=1000):
"""Detect ORB features in two images, match them.
Returns: matched keypoints in both images.
"""
orb = cv2.ORB_create(nfeatures=n_features)
kp1, desc1 = orb.detectAndCompute(img1, None)
kp2, desc2 = orb.detectAndCompute(img2, None)
if desc1 is None or desc2 is None:
return np.array([]), np.array([])
# BFMatcher with Hamming distance (ORB uses binary descriptors)
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
matches = bf.knnMatch(desc1, desc2, k=2)
# Lowe's ratio test: keep matches where best match is much better than second
good = []
for m, n in matches:
if m.distance < 0.75 * n.distance:
good.append(m)
# Extract matched point coordinates
pts1 = np.float32([kp1[m.queryIdx].pt for m in good])
pts2 = np.float32([kp2[m.trainIdx].pt for m in good])
return pts1, pts2
# Demo
img1 = cv2.imread("frame_001.jpg", cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread("frame_002.jpg", cv2.IMREAD_GRAYSCALE)
pts1, pts2 = detect_and_match(img1, img2)
print(f"Found {len(pts1)} good matches")What just happened: ORB detects corners (FAST keypoint detector) and describes the local patch around each corner (BRIEF descriptor). Matching finds corresponding points between frames. Lowe’s ratio test filters out ambiguous matches by requiring the best match to be significantly better than the second-best.
Visualize matches
def draw_matches(img1, img2, pts1, pts2, n_show=50):
"""Draw lines between matched points."""
h1, w1 = img1.shape[:2]
h2, w2 = img2.shape[:2]
canvas = np.zeros((max(h1, h2), w1 + w2, 3), dtype=np.uint8)
canvas[:h1, :w1] = cv2.cvtColor(img1, cv2.COLOR_GRAY2BGR) if img1.ndim == 2 else img1
canvas[:h2, w1:] = cv2.cvtColor(img2, cv2.COLOR_GRAY2BGR) if img2.ndim == 2 else img2
for i in range(min(n_show, len(pts1))):
pt1 = tuple(pts1[i].astype(int))
pt2 = (int(pts2[i][0]) + w1, int(pts2[i][1]))
color = tuple(np.random.randint(50, 255, 3).tolist())
cv2.line(canvas, pt1, pt2, color, 1)
cv2.circle(canvas, pt1, 3, color, -1)
cv2.circle(canvas, pt2, 3, color, -1)
return canvasStep 2: Essential matrix and camera pose
The essential matrix encodes the geometric relationship (rotation + translation) between two camera views.
def estimate_pose(pts1, pts2, camera_matrix):
"""Estimate camera rotation and translation from matched points.
camera_matrix: 3x3 intrinsic matrix [[fx, 0, cx], [0, fy, cy], [0, 0, 1]].
Returns: R (3x3 rotation), t (3x1 translation), inlier mask.
"""
# Find essential matrix using RANSAC (robust to outliers)
E, mask = cv2.findEssentialMat(
pts1, pts2, camera_matrix,
method=cv2.RANSAC,
prob=0.999,
threshold=1.0,
)
# Decompose E into R and t
# There are 4 possible decompositions; recoverPose picks the one
# where points are in front of both cameras
_, R, t, pose_mask = cv2.recoverPose(E, pts1, pts2, camera_matrix, mask=mask)
n_inliers = int(mask.sum()) if mask is not None else 0
print(f"Essential matrix: {n_inliers} inliers out of {len(pts1)} matches")
return R, t, mask
# Camera intrinsics (for a typical webcam or drone camera)
# You need to calibrate your camera for accurate results
# Or use approximate values: fx=fy=focal_length_pixels, cx=W/2, cy=H/2
fx, fy = 718.856, 718.856 # focal length in pixels
cx, cy = 607.193, 185.216 # principal point
K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]], dtype=np.float64)
R, t, mask = estimate_pose(pts1, pts2, K)
print(f"Rotation:\n{R}")
print(f"Translation (direction only): {t.flatten()}")What just happened: The essential matrix captures the epipolar geometry between two views. RANSAC makes the estimation robust to mismatches. recoverPose gives us the camera rotation and translation direction (not scale — monocular SLAM can’t recover absolute scale).
Step 3: Triangulation — getting 3D points
With the camera poses and matched 2D points, triangulate to get 3D point positions.
def triangulate_points(pts1, pts2, K, R, t):
"""Triangulate 3D points from two views.
Returns: (N, 3) array of 3D point positions.
"""
# Projection matrices
P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))]) # first camera at origin
P2 = K @ np.hstack([R, t]) # second camera
# Triangulate
pts4d = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T) # (4, N) homogeneous
pts3d = (pts4d[:3] / pts4d[3:]).T # (N, 3) Euclidean
# Filter: keep points in front of both cameras and within reasonable depth
valid = (pts3d[:, 2] > 0) & (pts3d[:, 2] < 100)
pts3d_valid = pts3d[valid]
print(f"Triangulated {len(pts3d_valid)} valid 3D points (out of {len(pts3d)})")
return pts3d_valid
pts3d = triangulate_points(pts1, pts2, K, R, t)Visualize the 3D points
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def plot_reconstruction(pts3d, R=None, t=None):
"""Plot 3D points and camera positions."""
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(pts3d[:, 0], pts3d[:, 1], pts3d[:, 2],
s=1, c="blue", alpha=0.5)
# Draw camera positions
ax.scatter(0, 0, 0, s=100, c="red", marker="^", label="Camera 1")
if R is not None and t is not None:
cam2_pos = -R.T @ t
ax.scatter(*cam2_pos.flatten(), s=100, c="green", marker="^", label="Camera 2")
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_zlabel("Z")
ax.legend()
plt.title(f"{len(pts3d)} reconstructed 3D points")
plt.savefig("reconstruction.png", dpi=150)
plt.show()Step 4: Simple two-frame visual odometry pipeline
Putting it all together: estimate camera motion between consecutive frames.
import cv2
import numpy as np
class SimpleVisualOdometry:
"""Minimal visual odometry: estimate camera trajectory from video.
Monocular -- no absolute scale.
"""
def __init__(self, camera_matrix, n_features=2000):
self.K = camera_matrix
self.n_features = n_features
self.orb = cv2.ORB_create(nfeatures=n_features)
self.bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
# Accumulated pose
self.R_total = np.eye(3)
self.t_total = np.zeros((3, 1))
self.trajectory = [(0.0, 0.0, 0.0)]
self.prev_gray = None
self.prev_kp = None
self.prev_desc = None
def process_frame(self, frame):
"""Process a new frame. Returns current position (x, y, z)."""
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) if frame.ndim == 3 else frame
kp, desc = self.orb.detectAndCompute(gray, None)
if self.prev_desc is None:
self.prev_gray = gray
self.prev_kp = kp
self.prev_desc = desc
return self.trajectory[-1]
# Match features
matches = self.bf.knnMatch(self.prev_desc, desc, k=2)
good = [m for m, n in matches if m.distance < 0.75 * n.distance]
if len(good) < 8:
# Not enough matches -- skip frame
self.prev_gray = gray
self.prev_kp = kp
self.prev_desc = desc
return self.trajectory[-1]
pts1 = np.float32([self.prev_kp[m.queryIdx].pt for m in good])
pts2 = np.float32([kp[m.trainIdx].pt for m in good])
# Estimate motion
E, mask = cv2.findEssentialMat(pts1, pts2, self.K,
method=cv2.RANSAC, prob=0.999, threshold=1.0)
if E is None:
self.prev_gray = gray
self.prev_kp = kp
self.prev_desc = desc
return self.trajectory[-1]
_, R, t, _ = cv2.recoverPose(E, pts1, pts2, self.K, mask=mask)
# Accumulate pose (compose transformations)
self.t_total = self.t_total + self.R_total @ t
self.R_total = R @ self.R_total
pos = self.t_total.flatten()
self.trajectory.append((pos[0], pos[1], pos[2]))
self.prev_gray = gray
self.prev_kp = kp
self.prev_desc = desc
return self.trajectory[-1]
def plot_trajectory(self):
"""Plot the estimated camera trajectory (top-down view)."""
traj = np.array(self.trajectory)
plt.figure(figsize=(8, 8))
plt.plot(traj[:, 0], traj[:, 2], "b-", linewidth=1)
plt.plot(traj[0, 0], traj[0, 2], "go", markersize=10, label="Start")
plt.plot(traj[-1, 0], traj[-1, 2], "ro", markersize=10, label="End")
plt.xlabel("X")
plt.ylabel("Z")
plt.title("Visual Odometry Trajectory (top-down)")
plt.legend()
plt.axis("equal")
plt.grid(True)
plt.savefig("vo_trajectory.png", dpi=150)
plt.show()
# Usage
# K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
# vo = SimpleVisualOdometry(K)
# cap = cv2.VideoCapture("drone_flight.mp4")
# while True:
# ret, frame = cap.read()
# if not ret: break
# pos = vo.process_frame(frame)
# print(f"Position: x={pos[0]:.2f}, y={pos[1]:.2f}, z={pos[2]:.2f}")
# vo.plot_trajectory()What just happened: This is the simplest visual odometry. For each new frame, we match features with the previous frame, estimate the relative motion (R, t), and accumulate the pose. The trajectory drifts over time because small errors compound — this is why real SLAM systems need bundle adjustment and loop closure.
Step 5: Bundle adjustment (concept)
Bundle adjustment jointly optimizes all camera poses and 3D points to minimize the total reprojection error across all frames.
reprojection_error = sum over all points i, cameras j:
|| observed_2d[i,j] - project(point_3d[i], camera_pose[j]) ||^2
This is a large nonlinear least-squares problem. Solved with Levenberg-Marquardt. Libraries: g2o, GTSAM, Ceres Solver (C++), or SciPy for small problems.
Bundle adjustment is what makes SLAM accurate. Without it, errors accumulate and the trajectory drifts.
Step 6: Loop closure
When the camera revisits a previously seen location, loop closure detects this and corrects the accumulated drift.
Frame 500: "These features look like frame 12!"
→ Add constraint: pose_500 ≈ pose_12
→ Re-optimize all poses with this new constraint
→ Drift corrected
Detection methods:
- Bag of visual words: represent each frame as a histogram of visual features, compare histograms
- DBoW2: hierarchical vocabulary tree (used in ORB-SLAM)
- NetVLAD: deep learning place recognition
Full SLAM systems
| System | Type | Key features |
|---|---|---|
| ORB-SLAM3 | Feature-based | Monocular/stereo/RGB-D, visual-inertial, map reuse |
| RTAB-Map | Appearance-based | RGB-D SLAM, large-scale, ROS integration |
| LSD-SLAM | Direct | Uses pixel intensities, not features. Semi-dense maps |
| DROID-SLAM (2021) | Deep learning | End-to-end differentiable, state-of-the-art accuracy |
For practical use with drones: ORB-SLAM3 (if you have stereo or IMU) or RTAB-Map (if you have RGB-D).
Applications
- Drone autonomous navigation: navigate GPS-denied environments (indoors, tunnels, under canopy)
- Indoor mapping: build 3D floor plans from camera + IMU
- Augmented reality: precisely overlay virtual objects on real world
- Search and rescue: map building interiors for first responders
- Military ISR: mapping from UAV video without relying on GPS (GPS can be jammed)
Self-test questions
- Why does monocular visual odometry drift over time? What causes the errors?
- Why can’t monocular SLAM recover absolute scale? What additional sensor could help?
- What is the purpose of RANSAC in essential matrix estimation?
- Explain the chicken-and-egg problem of SLAM: why are localization and mapping coupled?
- How does loop closure fix drift?
Try this next
- Multi-frame VO: Extend the SimpleVisualOdometry to use the KITTI odometry dataset. Compare your trajectory with ground truth GPS. Measure the drift rate.
- ORB-SLAM3: Install and run ORB-SLAM3 on KITTI or EuRoC dataset. Inspect the output map and trajectory.
- IMU integration: Read about visual-inertial odometry (VIO). The IMU provides scale and handles fast motion where features blur.
Links
- Optical Flow — pixel-level motion estimation, related to VO
- 3D Vision and Depth — depth estimation and 3D reconstruction
- Multi-Object Tracking — tracking objects within the SLAM map
- Case Study - CV Pipeline Design — choosing navigation approaches