Tutorial - Visual SLAM Concepts

Goal

After this tutorial, you understand how Simultaneous Localization and Mapping (SLAM) works: estimating camera pose while building a map of the environment. You will implement a simple two-frame visual odometry pipeline.

Prerequisites: Optical Flow, 3D Vision and Depth, basic linear algebra.

Time: 90-120 minutes.

What is SLAM?

You drop a drone into an unknown building. It needs to:

  1. Localize: know where it is (position + orientation)
  2. Map: build a representation of the environment

These two problems are coupled — you need a map to localize, and you need to know your location to build a map. SLAM solves both simultaneously.

┌─────────────────────────────────────────┐
│  Camera frame t                          │
│  ├─ Detect features                      │
│  ├─ Match with previous frame            │
│  ├─ Estimate motion (R, t)              │
│  ├─ Triangulate new 3D points           │
│  ├─ Update map                           │
│  └─ Optimize (bundle adjustment)        │
│                                          │
│  Loop closure: "I've been here before!" │
│  → correct accumulated drift             │
└─────────────────────────────────────────┘

Step 1: Feature detection and matching

The foundation of visual SLAM: find distinctive points in images and match them across frames.

ORB features (Oriented FAST and Rotated BRIEF)

ORB is fast, free (no patent), and good enough for real-time SLAM.

import cv2
import numpy as np
 
def detect_and_match(img1, img2, n_features=1000):
    """Detect ORB features in two images, match them.
    Returns: matched keypoints in both images.
    """
    orb = cv2.ORB_create(nfeatures=n_features)
 
    kp1, desc1 = orb.detectAndCompute(img1, None)
    kp2, desc2 = orb.detectAndCompute(img2, None)
 
    if desc1 is None or desc2 is None:
        return np.array([]), np.array([])
 
    # BFMatcher with Hamming distance (ORB uses binary descriptors)
    bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
    matches = bf.knnMatch(desc1, desc2, k=2)
 
    # Lowe's ratio test: keep matches where best match is much better than second
    good = []
    for m, n in matches:
        if m.distance < 0.75 * n.distance:
            good.append(m)
 
    # Extract matched point coordinates
    pts1 = np.float32([kp1[m.queryIdx].pt for m in good])
    pts2 = np.float32([kp2[m.trainIdx].pt for m in good])
 
    return pts1, pts2
 
# Demo
img1 = cv2.imread("frame_001.jpg", cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread("frame_002.jpg", cv2.IMREAD_GRAYSCALE)
 
pts1, pts2 = detect_and_match(img1, img2)
print(f"Found {len(pts1)} good matches")

What just happened: ORB detects corners (FAST keypoint detector) and describes the local patch around each corner (BRIEF descriptor). Matching finds corresponding points between frames. Lowe’s ratio test filters out ambiguous matches by requiring the best match to be significantly better than the second-best.

Visualize matches

def draw_matches(img1, img2, pts1, pts2, n_show=50):
    """Draw lines between matched points."""
    h1, w1 = img1.shape[:2]
    h2, w2 = img2.shape[:2]
 
    canvas = np.zeros((max(h1, h2), w1 + w2, 3), dtype=np.uint8)
    canvas[:h1, :w1] = cv2.cvtColor(img1, cv2.COLOR_GRAY2BGR) if img1.ndim == 2 else img1
    canvas[:h2, w1:] = cv2.cvtColor(img2, cv2.COLOR_GRAY2BGR) if img2.ndim == 2 else img2
 
    for i in range(min(n_show, len(pts1))):
        pt1 = tuple(pts1[i].astype(int))
        pt2 = (int(pts2[i][0]) + w1, int(pts2[i][1]))
        color = tuple(np.random.randint(50, 255, 3).tolist())
        cv2.line(canvas, pt1, pt2, color, 1)
        cv2.circle(canvas, pt1, 3, color, -1)
        cv2.circle(canvas, pt2, 3, color, -1)
 
    return canvas

Step 2: Essential matrix and camera pose

The essential matrix encodes the geometric relationship (rotation + translation) between two camera views.

def estimate_pose(pts1, pts2, camera_matrix):
    """Estimate camera rotation and translation from matched points.
    camera_matrix: 3x3 intrinsic matrix [[fx, 0, cx], [0, fy, cy], [0, 0, 1]].
    Returns: R (3x3 rotation), t (3x1 translation), inlier mask.
    """
    # Find essential matrix using RANSAC (robust to outliers)
    E, mask = cv2.findEssentialMat(
        pts1, pts2, camera_matrix,
        method=cv2.RANSAC,
        prob=0.999,
        threshold=1.0,
    )
 
    # Decompose E into R and t
    # There are 4 possible decompositions; recoverPose picks the one
    # where points are in front of both cameras
    _, R, t, pose_mask = cv2.recoverPose(E, pts1, pts2, camera_matrix, mask=mask)
 
    n_inliers = int(mask.sum()) if mask is not None else 0
    print(f"Essential matrix: {n_inliers} inliers out of {len(pts1)} matches")
 
    return R, t, mask
 
# Camera intrinsics (for a typical webcam or drone camera)
# You need to calibrate your camera for accurate results
# Or use approximate values: fx=fy=focal_length_pixels, cx=W/2, cy=H/2
fx, fy = 718.856, 718.856  # focal length in pixels
cx, cy = 607.193, 185.216  # principal point
K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]], dtype=np.float64)
 
R, t, mask = estimate_pose(pts1, pts2, K)
print(f"Rotation:\n{R}")
print(f"Translation (direction only): {t.flatten()}")

What just happened: The essential matrix captures the epipolar geometry between two views. RANSAC makes the estimation robust to mismatches. recoverPose gives us the camera rotation and translation direction (not scale — monocular SLAM can’t recover absolute scale).

Step 3: Triangulation — getting 3D points

With the camera poses and matched 2D points, triangulate to get 3D point positions.

def triangulate_points(pts1, pts2, K, R, t):
    """Triangulate 3D points from two views.
    Returns: (N, 3) array of 3D point positions.
    """
    # Projection matrices
    P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])    # first camera at origin
    P2 = K @ np.hstack([R, t])                             # second camera
 
    # Triangulate
    pts4d = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)  # (4, N) homogeneous
    pts3d = (pts4d[:3] / pts4d[3:]).T  # (N, 3) Euclidean
 
    # Filter: keep points in front of both cameras and within reasonable depth
    valid = (pts3d[:, 2] > 0) & (pts3d[:, 2] < 100)
    pts3d_valid = pts3d[valid]
 
    print(f"Triangulated {len(pts3d_valid)} valid 3D points (out of {len(pts3d)})")
    return pts3d_valid
 
pts3d = triangulate_points(pts1, pts2, K, R, t)

Visualize the 3D points

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def plot_reconstruction(pts3d, R=None, t=None):
    """Plot 3D points and camera positions."""
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection="3d")
 
    ax.scatter(pts3d[:, 0], pts3d[:, 1], pts3d[:, 2],
               s=1, c="blue", alpha=0.5)
 
    # Draw camera positions
    ax.scatter(0, 0, 0, s=100, c="red", marker="^", label="Camera 1")
    if R is not None and t is not None:
        cam2_pos = -R.T @ t
        ax.scatter(*cam2_pos.flatten(), s=100, c="green", marker="^", label="Camera 2")
 
    ax.set_xlabel("X")
    ax.set_ylabel("Y")
    ax.set_zlabel("Z")
    ax.legend()
    plt.title(f"{len(pts3d)} reconstructed 3D points")
    plt.savefig("reconstruction.png", dpi=150)
    plt.show()

Step 4: Simple two-frame visual odometry pipeline

Putting it all together: estimate camera motion between consecutive frames.

import cv2
import numpy as np
 
class SimpleVisualOdometry:
    """Minimal visual odometry: estimate camera trajectory from video.
    Monocular -- no absolute scale.
    """
    def __init__(self, camera_matrix, n_features=2000):
        self.K = camera_matrix
        self.n_features = n_features
        self.orb = cv2.ORB_create(nfeatures=n_features)
        self.bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
 
        # Accumulated pose
        self.R_total = np.eye(3)
        self.t_total = np.zeros((3, 1))
        self.trajectory = [(0.0, 0.0, 0.0)]
        self.prev_gray = None
        self.prev_kp = None
        self.prev_desc = None
 
    def process_frame(self, frame):
        """Process a new frame. Returns current position (x, y, z)."""
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) if frame.ndim == 3 else frame
 
        kp, desc = self.orb.detectAndCompute(gray, None)
 
        if self.prev_desc is None:
            self.prev_gray = gray
            self.prev_kp = kp
            self.prev_desc = desc
            return self.trajectory[-1]
 
        # Match features
        matches = self.bf.knnMatch(self.prev_desc, desc, k=2)
        good = [m for m, n in matches if m.distance < 0.75 * n.distance]
 
        if len(good) < 8:
            # Not enough matches -- skip frame
            self.prev_gray = gray
            self.prev_kp = kp
            self.prev_desc = desc
            return self.trajectory[-1]
 
        pts1 = np.float32([self.prev_kp[m.queryIdx].pt for m in good])
        pts2 = np.float32([kp[m.trainIdx].pt for m in good])
 
        # Estimate motion
        E, mask = cv2.findEssentialMat(pts1, pts2, self.K,
                                        method=cv2.RANSAC, prob=0.999, threshold=1.0)
        if E is None:
            self.prev_gray = gray
            self.prev_kp = kp
            self.prev_desc = desc
            return self.trajectory[-1]
 
        _, R, t, _ = cv2.recoverPose(E, pts1, pts2, self.K, mask=mask)
 
        # Accumulate pose (compose transformations)
        self.t_total = self.t_total + self.R_total @ t
        self.R_total = R @ self.R_total
 
        pos = self.t_total.flatten()
        self.trajectory.append((pos[0], pos[1], pos[2]))
 
        self.prev_gray = gray
        self.prev_kp = kp
        self.prev_desc = desc
 
        return self.trajectory[-1]
 
    def plot_trajectory(self):
        """Plot the estimated camera trajectory (top-down view)."""
        traj = np.array(self.trajectory)
        plt.figure(figsize=(8, 8))
        plt.plot(traj[:, 0], traj[:, 2], "b-", linewidth=1)
        plt.plot(traj[0, 0], traj[0, 2], "go", markersize=10, label="Start")
        plt.plot(traj[-1, 0], traj[-1, 2], "ro", markersize=10, label="End")
        plt.xlabel("X")
        plt.ylabel("Z")
        plt.title("Visual Odometry Trajectory (top-down)")
        plt.legend()
        plt.axis("equal")
        plt.grid(True)
        plt.savefig("vo_trajectory.png", dpi=150)
        plt.show()
 
# Usage
# K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
# vo = SimpleVisualOdometry(K)
# cap = cv2.VideoCapture("drone_flight.mp4")
# while True:
#     ret, frame = cap.read()
#     if not ret: break
#     pos = vo.process_frame(frame)
#     print(f"Position: x={pos[0]:.2f}, y={pos[1]:.2f}, z={pos[2]:.2f}")
# vo.plot_trajectory()

What just happened: This is the simplest visual odometry. For each new frame, we match features with the previous frame, estimate the relative motion (R, t), and accumulate the pose. The trajectory drifts over time because small errors compound — this is why real SLAM systems need bundle adjustment and loop closure.

Step 5: Bundle adjustment (concept)

Bundle adjustment jointly optimizes all camera poses and 3D points to minimize the total reprojection error across all frames.

reprojection_error = sum over all points i, cameras j:
    || observed_2d[i,j] - project(point_3d[i], camera_pose[j]) ||^2

This is a large nonlinear least-squares problem. Solved with Levenberg-Marquardt. Libraries: g2o, GTSAM, Ceres Solver (C++), or SciPy for small problems.

Bundle adjustment is what makes SLAM accurate. Without it, errors accumulate and the trajectory drifts.

Step 6: Loop closure

When the camera revisits a previously seen location, loop closure detects this and corrects the accumulated drift.

Frame 500: "These features look like frame 12!"
→ Add constraint: pose_500 ≈ pose_12
→ Re-optimize all poses with this new constraint
→ Drift corrected

Detection methods:

  • Bag of visual words: represent each frame as a histogram of visual features, compare histograms
  • DBoW2: hierarchical vocabulary tree (used in ORB-SLAM)
  • NetVLAD: deep learning place recognition

Full SLAM systems

SystemTypeKey features
ORB-SLAM3Feature-basedMonocular/stereo/RGB-D, visual-inertial, map reuse
RTAB-MapAppearance-basedRGB-D SLAM, large-scale, ROS integration
LSD-SLAMDirectUses pixel intensities, not features. Semi-dense maps
DROID-SLAM (2021)Deep learningEnd-to-end differentiable, state-of-the-art accuracy

For practical use with drones: ORB-SLAM3 (if you have stereo or IMU) or RTAB-Map (if you have RGB-D).

Applications

  • Drone autonomous navigation: navigate GPS-denied environments (indoors, tunnels, under canopy)
  • Indoor mapping: build 3D floor plans from camera + IMU
  • Augmented reality: precisely overlay virtual objects on real world
  • Search and rescue: map building interiors for first responders
  • Military ISR: mapping from UAV video without relying on GPS (GPS can be jammed)

Self-test questions

  1. Why does monocular visual odometry drift over time? What causes the errors?
  2. Why can’t monocular SLAM recover absolute scale? What additional sensor could help?
  3. What is the purpose of RANSAC in essential matrix estimation?
  4. Explain the chicken-and-egg problem of SLAM: why are localization and mapping coupled?
  5. How does loop closure fix drift?

Try this next

  1. Multi-frame VO: Extend the SimpleVisualOdometry to use the KITTI odometry dataset. Compare your trajectory with ground truth GPS. Measure the drift rate.
  2. ORB-SLAM3: Install and run ORB-SLAM3 on KITTI or EuRoC dataset. Inspect the output map and trajectory.
  3. IMU integration: Read about visual-inertial odometry (VIO). The IMU provides scale and handles fast motion where features blur.