3D Vision and Depth

What

Recover 3D structure from 2D images. The world is 3D; cameras capture 2D projections. Depth estimation, stereo vision, and 3D reconstruction reverse that projection to get back the 3D geometry.

This is fundamental for anything that needs to understand physical space: drone obstacle avoidance, autonomous navigation, 3D mapping, augmented reality.

Depth representations

Representation	Format	Source
Depth map	2D image where pixel value = distance	Depth sensor, stereo, monocular estimation
Point cloud	Set of 3D points (x, y, z) [+ color]	LiDAR, depth camera, SfM
Mesh	Triangulated surface from points	3D reconstruction
Voxel grid	3D pixel grid (occupied/empty)	Volumetric reconstruction

Stereo vision

Two cameras, known baseline (distance between them). Triangulate depth from the disparity between left and right views.

How it works

Rectify images: align them so corresponding points are on the same horizontal line
Match pixels: for each pixel in the left image, find the corresponding pixel in the right image
Compute disparity: d = x_left - x_right (horizontal offset)
Convert to depth: depth = (focal_length * baseline) / disparity

Closer objects have larger disparity (they shift more between views).

import cv2
import numpy as np
 
# Load stereo pair
left = cv2.imread("left.png", cv2.IMREAD_GRAYSCALE)
right = cv2.imread("right.png", cv2.IMREAD_GRAYSCALE)
 
# Semi-Global Block Matching (SGBM) -- best OpenCV stereo method
stereo = cv2.StereoSGBM_create(
    minDisparity=0,
    numDisparities=128,    # must be divisible by 16
    blockSize=5,
    P1=8 * 3 * 5**2,      # smoothness penalty
    P2=32 * 3 * 5**2,
    disp12MaxDiff=1,
    uniquenessRatio=10,
    speckleWindowSize=100,
    speckleRange=32,
)
 
disparity = stereo.compute(left, right).astype(np.float32) / 16.0
 
# Visualize
disp_normalized = cv2.normalize(disparity, None, 0, 255, cv2.NORM_MINMAX)
cv2.imwrite("disparity.png", disp_normalized.astype(np.uint8))
 
# Convert to depth (if you know camera parameters)
# focal_length = 721.5  # pixels
# baseline = 0.54       # meters
# depth = focal_length * baseline / (disparity + 1e-6)

Monocular depth estimation

Estimate depth from a single image. This is inherently ambiguous (a small nearby object looks the same as a large far object), but deep learning models learn strong priors from massive training data.

MiDaS and Depth Anything

MiDaS (Intel) and Depth Anything (TikTok/ByteDance) are the leading zero-shot monocular depth models. They produce relative depth (metric-free ordering) rather than absolute metric depth.

import torch
import cv2
import numpy as np
 
# Using Depth Anything v2 via torch hub
model = torch.hub.load("LiheYoung/Depth-Anything", "depth_anything_vitl14", pretrained=True)
model.eval()
 
image = cv2.imread("scene.jpg")
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
 
# Preprocess (resize + normalize)
from torchvision.transforms import Compose, Resize, Normalize, ToTensor
transform = Compose([
    Resize((518, 518)),
    ToTensor(),
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
 
input_tensor = transform(image_rgb).unsqueeze(0)
 
with torch.no_grad():
    depth = model(input_tensor)  # relative depth map
 
# Resize depth to original image size
depth_np = depth.squeeze().cpu().numpy()
depth_resized = cv2.resize(depth_np, (image.shape[1], image.shape[0]))
 
# Normalize for visualization
depth_vis = cv2.normalize(depth_resized, None, 0, 255, cv2.NORM_MINMAX)
depth_colored = cv2.applyColorMap(depth_vis.astype(np.uint8), cv2.COLORMAP_INFERNO)
cv2.imwrite("depth_output.png", depth_colored)

MiDaS via transformers

import torch
from transformers import DPTForDepthEstimation, DPTImageProcessor
from PIL import Image
 
processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
model.eval()
 
image = Image.open("scene.jpg")
inputs = processor(images=image, return_tensors="pt")
 
with torch.no_grad():
    outputs = model(**inputs)
    depth = outputs.predicted_depth  # (1, H, W) relative depth
 
# Higher values = closer (MiDaS convention -- inverse depth)

Relative vs metric depth

Type	What you get	Models
Relative	Depth ordering (closer/farther)	MiDaS, Depth Anything
Metric	Actual distances in meters	ZoeDepth, Metric3D, UniDepth

Relative depth is enough for most vision tasks (segmentation, occlusion reasoning). Metric depth is needed for robotics and navigation.

Point clouds

A point cloud is a set of 3D points, each with (x, y, z) and optionally (r, g, b) color. This is the raw output of LiDAR sensors and the intermediate representation in most 3D reconstruction pipelines.

Creating a point cloud from depth + RGB

import numpy as np
import open3d as o3d
 
def depth_to_pointcloud(rgb, depth, fx, fy, cx, cy):
    """Convert RGB image + depth map to colored point cloud.
    fx, fy: focal lengths in pixels.
    cx, cy: principal point (usually image center).
    """
    h, w = depth.shape
    u, v = np.meshgrid(np.arange(w), np.arange(h))
 
    # Back-project pixels to 3D
    z = depth.astype(np.float64)
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy
 
    # Stack into (N, 3) point array
    valid = z > 0  # ignore invalid depth
    points = np.stack([x[valid], y[valid], z[valid]], axis=-1)
    colors = rgb[valid].astype(np.float64) / 255.0  # normalize to [0,1]
 
    # Create Open3D point cloud
    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(points)
    pcd.colors = o3d.utility.Vector3dVector(colors)
 
    return pcd
 
# Example: create from depth camera output
# rgb = cv2.imread("color.png")[:, :, ::-1]  # BGR to RGB
# depth = cv2.imread("depth.png", cv2.IMREAD_UNCHANGED)  # uint16, millimeters
# depth_m = depth.astype(np.float64) / 1000.0  # convert to meters
# pcd = depth_to_pointcloud(rgb, depth_m, fx=525.0, fy=525.0, cx=319.5, cy=239.5)
# o3d.visualization.draw_geometries([pcd])
# o3d.io.write_point_cloud("output.ply", pcd)

Point cloud processing with Open3D

import open3d as o3d
 
pcd = o3d.io.read_point_cloud("scene.ply")
 
# Downsample (reduce number of points, faster processing)
pcd_down = pcd.voxel_down_sample(voxel_size=0.05)
 
# Remove outliers
pcd_clean, _ = pcd_down.remove_statistical_outlier(
    nb_neighbors=20, std_ratio=2.0
)
 
# Estimate normals (needed for mesh reconstruction)
pcd_clean.estimate_normals(
    search_param=o3d.geometry.KDTreeSearchParamHybrid(radius=0.1, max_nn=30)
)
 
# Plane segmentation (e.g., find ground plane)
plane_model, inliers = pcd_clean.segment_plane(
    distance_threshold=0.02, ransac_n=3, num_iterations=1000
)
a, b, c, d = plane_model
print(f"Ground plane: {a:.3f}x + {b:.3f}y + {c:.3f}z + {d:.3f} = 0")

Structure from Motion (SfM)

Reconstruct 3D structure from multiple 2D photos taken from different viewpoints. Also recovers camera poses.

Pipeline:

Detect features (SIFT, SuperPoint) in all images
Match features across image pairs
Estimate camera poses (essential/fundamental matrix)
Triangulate 3D points from matched features
Bundle adjustment: jointly optimize camera poses and 3D points to minimize reprojection error

Tools: COLMAP (gold standard), OpenSfM, Meshroom.

SfM is the foundation of Tutorial - Visual SLAM Concepts.

LiDAR basics

LiDAR (Light Detection And Ranging) directly measures distances by sending laser pulses and timing the return.

Spinning LiDAR (Velodyne, Ouster): 360-degree scans, used in autonomous vehicles
Solid-state LiDAR (Livox): no moving parts, cheaper, limited FOV
Flash LiDAR: illuminates entire scene at once, like a depth camera

Output: point cloud with (x, y, z, intensity, timestamp, ring).

LiDAR gives metric depth directly — no estimation required. But it’s expensive, heavy, and produces sparse point clouds compared to cameras.

Applications

Drone obstacle avoidance: estimate depth to prevent collisions. Monocular depth works for small drones where stereo/LiDAR is too heavy
3D mapping: reconstruct environments from drone or handheld cameras (SfM → point cloud → mesh)
Autonomous navigation: understand 3D scene to plan safe paths
Terrain analysis: process drone/satellite imagery for elevation mapping
Damage assessment: compare 3D reconstructions before and after events

Self-test questions

Why is monocular depth estimation fundamentally ambiguous, and how do neural networks handle this?
What is the relationship between disparity and depth in stereo vision?
What is the difference between relative and metric depth estimation?
Why is bundle adjustment necessary in SfM, and what does it optimize?
When would you choose LiDAR over camera-based depth estimation?

Exercises

MiDaS depth: Generate depth maps for 5 different images (indoor, outdoor, aerial) using MiDaS or Depth Anything. Qualitatively assess where the model succeeds and fails.
Point cloud from depth: Given a depth map and RGB image, create a colored point cloud using the depth_to_pointcloud function above. Visualize in Open3D.
SfM exploration: Take 20+ photos of a small object from different angles. Run COLMAP to reconstruct it in 3D. Inspect the output point cloud.

AI/ML Notes

Explorer

3D Vision and Depth

3D Vision and Depth

What

Depth representations

Stereo vision

How it works

Monocular depth estimation

MiDaS and Depth Anything

MiDaS via transformers

Relative vs metric depth

Point clouds

Creating a point cloud from depth + RGB

Point cloud processing with Open3D

Structure from Motion (SfM)

LiDAR basics

Applications

Self-test questions

Exercises

Links

Graph View

Table of Contents

Backlinks