3D Vision and Depth
What
Recover 3D structure from 2D images. The world is 3D; cameras capture 2D projections. Depth estimation, stereo vision, and 3D reconstruction reverse that projection to get back the 3D geometry.
This is fundamental for anything that needs to understand physical space: drone obstacle avoidance, autonomous navigation, 3D mapping, augmented reality.
Depth representations
| Representation | Format | Source |
|---|---|---|
| Depth map | 2D image where pixel value = distance | Depth sensor, stereo, monocular estimation |
| Point cloud | Set of 3D points (x, y, z) [+ color] | LiDAR, depth camera, SfM |
| Mesh | Triangulated surface from points | 3D reconstruction |
| Voxel grid | 3D pixel grid (occupied/empty) | Volumetric reconstruction |
Stereo vision
Two cameras, known baseline (distance between them). Triangulate depth from the disparity between left and right views.
How it works
- Rectify images: align them so corresponding points are on the same horizontal line
- Match pixels: for each pixel in the left image, find the corresponding pixel in the right image
- Compute disparity: d = x_left - x_right (horizontal offset)
- Convert to depth: depth = (focal_length * baseline) / disparity
Closer objects have larger disparity (they shift more between views).
import cv2
import numpy as np
# Load stereo pair
left = cv2.imread("left.png", cv2.IMREAD_GRAYSCALE)
right = cv2.imread("right.png", cv2.IMREAD_GRAYSCALE)
# Semi-Global Block Matching (SGBM) -- best OpenCV stereo method
stereo = cv2.StereoSGBM_create(
minDisparity=0,
numDisparities=128, # must be divisible by 16
blockSize=5,
P1=8 * 3 * 5**2, # smoothness penalty
P2=32 * 3 * 5**2,
disp12MaxDiff=1,
uniquenessRatio=10,
speckleWindowSize=100,
speckleRange=32,
)
disparity = stereo.compute(left, right).astype(np.float32) / 16.0
# Visualize
disp_normalized = cv2.normalize(disparity, None, 0, 255, cv2.NORM_MINMAX)
cv2.imwrite("disparity.png", disp_normalized.astype(np.uint8))
# Convert to depth (if you know camera parameters)
# focal_length = 721.5 # pixels
# baseline = 0.54 # meters
# depth = focal_length * baseline / (disparity + 1e-6)Monocular depth estimation
Estimate depth from a single image. This is inherently ambiguous (a small nearby object looks the same as a large far object), but deep learning models learn strong priors from massive training data.
MiDaS and Depth Anything
MiDaS (Intel) and Depth Anything (TikTok/ByteDance) are the leading zero-shot monocular depth models. They produce relative depth (metric-free ordering) rather than absolute metric depth.
import torch
import cv2
import numpy as np
# Using Depth Anything v2 via torch hub
model = torch.hub.load("LiheYoung/Depth-Anything", "depth_anything_vitl14", pretrained=True)
model.eval()
image = cv2.imread("scene.jpg")
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Preprocess (resize + normalize)
from torchvision.transforms import Compose, Resize, Normalize, ToTensor
transform = Compose([
Resize((518, 518)),
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = transform(image_rgb).unsqueeze(0)
with torch.no_grad():
depth = model(input_tensor) # relative depth map
# Resize depth to original image size
depth_np = depth.squeeze().cpu().numpy()
depth_resized = cv2.resize(depth_np, (image.shape[1], image.shape[0]))
# Normalize for visualization
depth_vis = cv2.normalize(depth_resized, None, 0, 255, cv2.NORM_MINMAX)
depth_colored = cv2.applyColorMap(depth_vis.astype(np.uint8), cv2.COLORMAP_INFERNO)
cv2.imwrite("depth_output.png", depth_colored)MiDaS via transformers
import torch
from transformers import DPTForDepthEstimation, DPTImageProcessor
from PIL import Image
processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
model.eval()
image = Image.open("scene.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
depth = outputs.predicted_depth # (1, H, W) relative depth
# Higher values = closer (MiDaS convention -- inverse depth)Relative vs metric depth
| Type | What you get | Models |
|---|---|---|
| Relative | Depth ordering (closer/farther) | MiDaS, Depth Anything |
| Metric | Actual distances in meters | ZoeDepth, Metric3D, UniDepth |
Relative depth is enough for most vision tasks (segmentation, occlusion reasoning). Metric depth is needed for robotics and navigation.
Point clouds
A point cloud is a set of 3D points, each with (x, y, z) and optionally (r, g, b) color. This is the raw output of LiDAR sensors and the intermediate representation in most 3D reconstruction pipelines.
Creating a point cloud from depth + RGB
import numpy as np
import open3d as o3d
def depth_to_pointcloud(rgb, depth, fx, fy, cx, cy):
"""Convert RGB image + depth map to colored point cloud.
fx, fy: focal lengths in pixels.
cx, cy: principal point (usually image center).
"""
h, w = depth.shape
u, v = np.meshgrid(np.arange(w), np.arange(h))
# Back-project pixels to 3D
z = depth.astype(np.float64)
x = (u - cx) * z / fx
y = (v - cy) * z / fy
# Stack into (N, 3) point array
valid = z > 0 # ignore invalid depth
points = np.stack([x[valid], y[valid], z[valid]], axis=-1)
colors = rgb[valid].astype(np.float64) / 255.0 # normalize to [0,1]
# Create Open3D point cloud
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points)
pcd.colors = o3d.utility.Vector3dVector(colors)
return pcd
# Example: create from depth camera output
# rgb = cv2.imread("color.png")[:, :, ::-1] # BGR to RGB
# depth = cv2.imread("depth.png", cv2.IMREAD_UNCHANGED) # uint16, millimeters
# depth_m = depth.astype(np.float64) / 1000.0 # convert to meters
# pcd = depth_to_pointcloud(rgb, depth_m, fx=525.0, fy=525.0, cx=319.5, cy=239.5)
# o3d.visualization.draw_geometries([pcd])
# o3d.io.write_point_cloud("output.ply", pcd)Point cloud processing with Open3D
import open3d as o3d
pcd = o3d.io.read_point_cloud("scene.ply")
# Downsample (reduce number of points, faster processing)
pcd_down = pcd.voxel_down_sample(voxel_size=0.05)
# Remove outliers
pcd_clean, _ = pcd_down.remove_statistical_outlier(
nb_neighbors=20, std_ratio=2.0
)
# Estimate normals (needed for mesh reconstruction)
pcd_clean.estimate_normals(
search_param=o3d.geometry.KDTreeSearchParamHybrid(radius=0.1, max_nn=30)
)
# Plane segmentation (e.g., find ground plane)
plane_model, inliers = pcd_clean.segment_plane(
distance_threshold=0.02, ransac_n=3, num_iterations=1000
)
a, b, c, d = plane_model
print(f"Ground plane: {a:.3f}x + {b:.3f}y + {c:.3f}z + {d:.3f} = 0")Structure from Motion (SfM)
Reconstruct 3D structure from multiple 2D photos taken from different viewpoints. Also recovers camera poses.
Pipeline:
- Detect features (SIFT, SuperPoint) in all images
- Match features across image pairs
- Estimate camera poses (essential/fundamental matrix)
- Triangulate 3D points from matched features
- Bundle adjustment: jointly optimize camera poses and 3D points to minimize reprojection error
Tools: COLMAP (gold standard), OpenSfM, Meshroom.
SfM is the foundation of Tutorial - Visual SLAM Concepts.
LiDAR basics
LiDAR (Light Detection And Ranging) directly measures distances by sending laser pulses and timing the return.
- Spinning LiDAR (Velodyne, Ouster): 360-degree scans, used in autonomous vehicles
- Solid-state LiDAR (Livox): no moving parts, cheaper, limited FOV
- Flash LiDAR: illuminates entire scene at once, like a depth camera
Output: point cloud with (x, y, z, intensity, timestamp, ring).
LiDAR gives metric depth directly — no estimation required. But it’s expensive, heavy, and produces sparse point clouds compared to cameras.
Applications
- Drone obstacle avoidance: estimate depth to prevent collisions. Monocular depth works for small drones where stereo/LiDAR is too heavy
- 3D mapping: reconstruct environments from drone or handheld cameras (SfM → point cloud → mesh)
- Autonomous navigation: understand 3D scene to plan safe paths
- Terrain analysis: process drone/satellite imagery for elevation mapping
- Damage assessment: compare 3D reconstructions before and after events
Self-test questions
- Why is monocular depth estimation fundamentally ambiguous, and how do neural networks handle this?
- What is the relationship between disparity and depth in stereo vision?
- What is the difference between relative and metric depth estimation?
- Why is bundle adjustment necessary in SfM, and what does it optimize?
- When would you choose LiDAR over camera-based depth estimation?
Exercises
- MiDaS depth: Generate depth maps for 5 different images (indoor, outdoor, aerial) using MiDaS or Depth Anything. Qualitatively assess where the model succeeds and fails.
- Point cloud from depth: Given a depth map and RGB image, create a colored point cloud using the
depth_to_pointcloudfunction above. Visualize in Open3D. - SfM exploration: Take 20+ photos of a small object from different angles. Run COLMAP to reconstruct it in 3D. Inspect the output point cloud.
Links
- Image Segmentation — depth helps disambiguate overlapping objects
- Convolutional Neural Networks — backbone for depth estimation networks
- Optical Flow — motion and depth are related through ego-motion
- Tutorial - Visual SLAM Concepts — real-time 3D mapping
- Tutorial - Aerial Image Analysis — aerial depth and mapping