Tutorial - Sim-to-Real Transfer

Goal

After this tutorial, you understand why sim-to-real transfer is hard and how to bridge the gap. You will train a policy in simulation, observe it fail when the simulation changes, and apply domain randomization to make it robust.

Prerequisites: Actor-Critic and PPO, Model-Based RL, basic Gymnasium.

Time: 60-90 minutes.

The reality gap

RL agents train in simulation (fast, safe, cheap). Real-world deployment requires the policy to work on real hardware (slow, dangerous, expensive). The gap between simulated and real dynamics is the reality gap.

Simulation                    Real world
  Perfect physics              Friction varies
  No sensor noise              Noisy sensors
  Instant actions              Action delay/latency
  Exact state                  Partial observability
  Infinite resets              Can't reset (easily)

A policy that works perfectly in simulation can fail completely on a real drone because:

  • Motor dynamics are slightly different
  • Aerodynamic effects not modeled (turbulence, ground effect)
  • Sensor readings are noisy and delayed
  • Weight distribution doesn’t match the simulated model

Step 1: Train agent in standard environment

import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Categorical
import matplotlib.pyplot as plt
 
# Train on standard CartPole
# (We'll use CartPole as our "simulation" and modified CartPole as "reality")
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
 
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=0, n_steps=1024)
model.learn(total_timesteps=50_000)
 
# Evaluate in training environment (simulation)
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=50)
print(f"Performance in training env: {mean_reward:.1f} +/- {std_reward:.1f}")
# Should be ~500 (perfect score)

Step 2: Create modified environment (the “real world”)

Modify the physics to simulate a different-than-expected reality.

import gymnasium as gym
from gymnasium.envs.classic_control.cartpole import CartPoleEnv
 
class ModifiedCartPole(CartPoleEnv):
    """CartPole with modified physics to simulate reality gap.
    Changes: different gravity, mass, friction, observation noise.
    """
    def __init__(self, gravity=12.0, masscart=1.5, masspole=0.2,
                 obs_noise_std=0.05, force_noise_std=2.0, **kwargs):
        super().__init__(**kwargs)
        self.gravity = gravity            # default: 9.8
        self.masscart = masscart          # default: 1.0
        self.masspole = masspole          # default: 0.1
        self.total_mass = masscart + masspole
        self.polemass_length = masspole * self.length
        self.obs_noise_std = obs_noise_std
        self.force_noise_std = force_noise_std
 
    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)
 
        # Add observation noise
        if self.obs_noise_std > 0:
            obs = obs + np.random.normal(0, self.obs_noise_std, obs.shape)
 
        return obs, reward, terminated, truncated, info
 
# Register modified environment
gym.register(
    id="CartPole-Real-v0",
    entry_point=lambda: ModifiedCartPole(
        gravity=12.0, masscart=1.5, masspole=0.2,
        obs_noise_std=0.05, force_noise_std=2.0,
    ),
)

Step 3: Test policy in modified environment (the gap)

# Test the simulation-trained policy in "reality"
real_env = ModifiedCartPole(
    gravity=12.0, masscart=1.5, masspole=0.2,
    obs_noise_std=0.05,
)
 
# Manual evaluation (stable-baselines3 evaluate_policy works too)
def evaluate_in_env(model, env, n_episodes=50):
    rewards = []
    for _ in range(n_episodes):
        obs, _ = env.reset()
        total_reward = 0
        done = False
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward
        rewards.append(total_reward)
    return np.mean(rewards), np.std(rewards)
 
sim_perf, sim_std = evaluate_in_env(model, env)
real_perf, real_std = evaluate_in_env(model, real_env)
 
print(f"Simulation performance: {sim_perf:.1f} +/- {sim_std:.1f}")
print(f"'Real world' performance: {real_perf:.1f} +/- {real_std:.1f}")
print(f"Reality gap: {sim_perf - real_perf:.1f} reward drop")
# Expect significant performance drop

What just happened: The policy was trained in a world with gravity=9.8, mass=1.0, no noise. When deployed to a world with gravity=12, heavier cart, and noisy observations, it fails. This is the reality gap. The policy learned to control a specific system, not a class of systems.

Step 4: Domain randomization

Train with randomized physics. Each episode gets different parameters. The policy must learn to be robust across the full range.

class DomainRandomizedCartPole(CartPoleEnv):
    """CartPole with randomized physics each reset."""
 
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.obs_noise_std = 0.0
 
    def reset(self, **kwargs):
        # Randomize physics at each episode start
        self.gravity = np.random.uniform(8.0, 14.0)     # 9.8 +/- wide range
        self.masscart = np.random.uniform(0.5, 2.0)      # 1.0 +/- range
        self.masspole = np.random.uniform(0.05, 0.3)     # 0.1 +/- range
        self.total_mass = self.masscart + self.masspole
        self.polemass_length = self.masspole * self.length
        self.obs_noise_std = np.random.uniform(0.0, 0.1)
 
        return super().reset(**kwargs)
 
    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)
        if self.obs_noise_std > 0:
            obs = obs + np.random.normal(0, self.obs_noise_std, obs.shape)
        return obs, reward, terminated, truncated, info
 
# Train with domain randomization
dr_env = DomainRandomizedCartPole()
dr_model = PPO("MlpPolicy", dr_env, verbose=0, n_steps=1024)
dr_model.learn(total_timesteps=100_000)  # more timesteps needed (harder task)
 
# Evaluate
dr_sim_perf, _ = evaluate_in_env(dr_model, env)
dr_real_perf, _ = evaluate_in_env(dr_model, real_env)
 
print(f"\nDomain-randomized policy:")
print(f"  Simulation: {dr_sim_perf:.1f}")
print(f"  'Real world': {dr_real_perf:.1f}")
print(f"\nOriginal policy:")
print(f"  Simulation: {sim_perf:.1f}")
print(f"  'Real world': {real_perf:.1f}")

What just happened: Domain randomization forces the policy to handle a wide range of dynamics. It may perform slightly worse in the nominal simulation (because it can’t specialize) but much better in the modified “real” environment. This is the core of sim-to-real transfer.

Step 5: Test robustness across many variations

def robustness_test(model, n_variations=20, n_episodes_per=10):
    """Test policy across many physics variations."""
    results = []
 
    for i in range(n_variations):
        test_env = ModifiedCartPole(
            gravity=np.random.uniform(7.0, 15.0),
            masscart=np.random.uniform(0.3, 2.5),
            masspole=np.random.uniform(0.03, 0.4),
            obs_noise_std=np.random.uniform(0.0, 0.15),
        )
        mean_r, _ = evaluate_in_env(model, test_env, n_episodes_per)
        results.append(mean_r)
 
    return results
 
standard_results = robustness_test(model)
dr_results = robustness_test(dr_model)
 
plt.figure(figsize=(10, 5))
plt.boxplot([standard_results, dr_results],
            labels=["Standard training", "Domain randomized"])
plt.ylabel("Mean episode reward")
plt.title("Robustness across physics variations")
plt.grid(True, axis="y")
plt.savefig("sim2real_robustness.png", dpi=150)
plt.show()
 
print(f"Standard: mean={np.mean(standard_results):.1f}, "
      f"min={np.min(standard_results):.1f}")
print(f"Domain-randomized: mean={np.mean(dr_results):.1f}, "
      f"min={np.min(dr_results):.1f}")

Step 6: System identification

Instead of randomizing everything, try to make the simulator match reality as closely as possible.

def system_identification(real_env, sim_env_class, n_samples=100):
    """Estimate real-world parameters by observing real transitions.
    Collect data from real environment, fit simulator parameters.
 
    In practice: record sensor data from real hardware,
    optimize sim parameters to minimize prediction error.
    """
    # Collect real data
    real_data = []
    env = real_env
    obs, _ = env.reset()
    for _ in range(n_samples):
        action = env.action_space.sample()
        next_obs, _, done, _, _ = env.step(action)
        real_data.append((obs, action, next_obs))
        if done:
            obs, _ = env.reset()
        else:
            obs = next_obs
 
    # Grid search over sim parameters (in practice: Bayesian optimization)
    best_error = float("inf")
    best_params = {}
 
    for gravity in np.linspace(8, 14, 10):
        for mass in np.linspace(0.5, 2.5, 10):
            sim_env = sim_env_class(gravity=gravity, masscart=mass)
 
            # Compute prediction error
            error = 0
            for obs, action, real_next in real_data:
                sim_env.state = obs
                sim_next, _, _, _, _ = sim_env.step(action)
                error += np.sum((sim_next - real_next) ** 2)
 
            if error < best_error:
                best_error = error
                best_params = {"gravity": gravity, "masscart": mass}
 
    print(f"Best sim parameters: {best_params} (error: {best_error:.4f})")
    return best_params

Step 7: Progressive transfer

Train on increasingly realistic simulations:

def progressive_transfer(base_model, stages, steps_per_stage=30_000):
    """Fine-tune through progressively more realistic simulators."""
    model = base_model
 
    for i, params in enumerate(stages):
        print(f"\nStage {i+1}: {params}")
        stage_env = ModifiedCartPole(**params)
 
        # Continue training from previous model
        model.set_env(stage_env)
        model.learn(total_timesteps=steps_per_stage, reset_num_timesteps=False)
 
        # Evaluate
        perf, _ = evaluate_in_env(model, stage_env, n_episodes=20)
        print(f"  Performance: {perf:.1f}")
 
    return model
 
# Define progressive stages toward reality
stages = [
    {"gravity": 10.0, "masscart": 1.1, "obs_noise_std": 0.01},   # slight change
    {"gravity": 11.0, "masscart": 1.3, "obs_noise_std": 0.03},   # more change
    {"gravity": 12.0, "masscart": 1.5, "obs_noise_std": 0.05},   # target "reality"
]
 
# progressive_model = progressive_transfer(dr_model, stages)

Real-world examples

OpenAI Rubik’s Cube (2019)

Trained a dexterous robot hand to solve a Rubik’s cube. Key techniques:

  • Massive domain randomization (cube size, friction, mass, visual appearance)
  • Automatic domain randomization (ADR): gradually increase randomization range based on policy performance
  • Trained entirely in simulation, transferred zero-shot to real robot

Drone navigation

The standard sim-to-real pipeline for autonomous drones:

  1. Train in simulator (AirSim, Gazebo, Isaac Gym)
  2. Domain randomize: wind, turbulence, payload mass, motor response, sensor noise
  3. System identification: measure real drone parameters (motor curves, IMU bias)
  4. Fine-tune in high-fidelity simulation tuned to real drone
  5. Deploy with safety constraints (attitude limits, emergency stop)

This is THE bottleneck for autonomous FPV drones. The sim-to-real gap is significant because:

  • Aerodynamics are chaotic (turbulence near obstacles)
  • Motor dynamics are nonlinear
  • Visual appearance differs (lighting, texture)
  • Latency varies (real hardware has processing delays)

Boston Dynamics

Their robots train fundamental locomotion skills in simulation with domain randomization, then fine-tune on hardware with conservative safety margins.

Summary of approaches

ApproachIdeaEffortRobustness
Domain randomizationRandomize everything during trainingLowGood if range covers reality
System identificationTune simulator to match real systemHigh (needs real data)Best for known system
Progressive transferGradual adaptation toward realityMediumGood, controlled
Fine-tuning on realTrain briefly on real hardwareNeeds real accessBest, but expensive
Meta-learningLearn to adapt quickly to new dynamicsHighGood for new environments

In practice, combine them: system identification to get a reasonable sim, domain randomization for robustness, progressive transfer if you have real access.

Self-test questions

  1. Why does a policy trained in simulation typically fail when deployed to real hardware?
  2. What is domain randomization, and why does it improve transfer?
  3. When would system identification be preferred over domain randomization?
  4. Why is progressive transfer useful when the target domain is known?
  5. What makes drone sim-to-real particularly challenging compared to manipulation?

Exercises

  1. Quantify the gap: Train on standard CartPole, test on 50 random variations. Plot reward vs gravity deviation. At what gravity deviation does the policy fail?
  2. Adaptive randomization: Start with narrow randomization ranges. If the agent succeeds > 80% of the time, widen the ranges. Implement this adaptive schedule and compare with fixed randomization.
  3. Visual domain randomization: If you have access to a simulator with visual output (e.g., Gymnasium with render), randomize visual properties (background color, object texture). Test if a vision-based policy transfers across visual variations.