Tutorial - Sim-to-Real Transfer
Goal
After this tutorial, you understand why sim-to-real transfer is hard and how to bridge the gap. You will train a policy in simulation, observe it fail when the simulation changes, and apply domain randomization to make it robust.
Prerequisites: Actor-Critic and PPO, Model-Based RL, basic Gymnasium.
Time: 60-90 minutes.
The reality gap
RL agents train in simulation (fast, safe, cheap). Real-world deployment requires the policy to work on real hardware (slow, dangerous, expensive). The gap between simulated and real dynamics is the reality gap.
Simulation Real world
Perfect physics Friction varies
No sensor noise Noisy sensors
Instant actions Action delay/latency
Exact state Partial observability
Infinite resets Can't reset (easily)
A policy that works perfectly in simulation can fail completely on a real drone because:
- Motor dynamics are slightly different
- Aerodynamic effects not modeled (turbulence, ground effect)
- Sensor readings are noisy and delayed
- Weight distribution doesn’t match the simulated model
Step 1: Train agent in standard environment
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Categorical
import matplotlib.pyplot as plt
# Train on standard CartPole
# (We'll use CartPole as our "simulation" and modified CartPole as "reality")
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=0, n_steps=1024)
model.learn(total_timesteps=50_000)
# Evaluate in training environment (simulation)
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=50)
print(f"Performance in training env: {mean_reward:.1f} +/- {std_reward:.1f}")
# Should be ~500 (perfect score)Step 2: Create modified environment (the “real world”)
Modify the physics to simulate a different-than-expected reality.
import gymnasium as gym
from gymnasium.envs.classic_control.cartpole import CartPoleEnv
class ModifiedCartPole(CartPoleEnv):
"""CartPole with modified physics to simulate reality gap.
Changes: different gravity, mass, friction, observation noise.
"""
def __init__(self, gravity=12.0, masscart=1.5, masspole=0.2,
obs_noise_std=0.05, force_noise_std=2.0, **kwargs):
super().__init__(**kwargs)
self.gravity = gravity # default: 9.8
self.masscart = masscart # default: 1.0
self.masspole = masspole # default: 0.1
self.total_mass = masscart + masspole
self.polemass_length = masspole * self.length
self.obs_noise_std = obs_noise_std
self.force_noise_std = force_noise_std
def step(self, action):
obs, reward, terminated, truncated, info = super().step(action)
# Add observation noise
if self.obs_noise_std > 0:
obs = obs + np.random.normal(0, self.obs_noise_std, obs.shape)
return obs, reward, terminated, truncated, info
# Register modified environment
gym.register(
id="CartPole-Real-v0",
entry_point=lambda: ModifiedCartPole(
gravity=12.0, masscart=1.5, masspole=0.2,
obs_noise_std=0.05, force_noise_std=2.0,
),
)Step 3: Test policy in modified environment (the gap)
# Test the simulation-trained policy in "reality"
real_env = ModifiedCartPole(
gravity=12.0, masscart=1.5, masspole=0.2,
obs_noise_std=0.05,
)
# Manual evaluation (stable-baselines3 evaluate_policy works too)
def evaluate_in_env(model, env, n_episodes=50):
rewards = []
for _ in range(n_episodes):
obs, _ = env.reset()
total_reward = 0
done = False
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
rewards.append(total_reward)
return np.mean(rewards), np.std(rewards)
sim_perf, sim_std = evaluate_in_env(model, env)
real_perf, real_std = evaluate_in_env(model, real_env)
print(f"Simulation performance: {sim_perf:.1f} +/- {sim_std:.1f}")
print(f"'Real world' performance: {real_perf:.1f} +/- {real_std:.1f}")
print(f"Reality gap: {sim_perf - real_perf:.1f} reward drop")
# Expect significant performance dropWhat just happened: The policy was trained in a world with gravity=9.8, mass=1.0, no noise. When deployed to a world with gravity=12, heavier cart, and noisy observations, it fails. This is the reality gap. The policy learned to control a specific system, not a class of systems.
Step 4: Domain randomization
Train with randomized physics. Each episode gets different parameters. The policy must learn to be robust across the full range.
class DomainRandomizedCartPole(CartPoleEnv):
"""CartPole with randomized physics each reset."""
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.obs_noise_std = 0.0
def reset(self, **kwargs):
# Randomize physics at each episode start
self.gravity = np.random.uniform(8.0, 14.0) # 9.8 +/- wide range
self.masscart = np.random.uniform(0.5, 2.0) # 1.0 +/- range
self.masspole = np.random.uniform(0.05, 0.3) # 0.1 +/- range
self.total_mass = self.masscart + self.masspole
self.polemass_length = self.masspole * self.length
self.obs_noise_std = np.random.uniform(0.0, 0.1)
return super().reset(**kwargs)
def step(self, action):
obs, reward, terminated, truncated, info = super().step(action)
if self.obs_noise_std > 0:
obs = obs + np.random.normal(0, self.obs_noise_std, obs.shape)
return obs, reward, terminated, truncated, info
# Train with domain randomization
dr_env = DomainRandomizedCartPole()
dr_model = PPO("MlpPolicy", dr_env, verbose=0, n_steps=1024)
dr_model.learn(total_timesteps=100_000) # more timesteps needed (harder task)
# Evaluate
dr_sim_perf, _ = evaluate_in_env(dr_model, env)
dr_real_perf, _ = evaluate_in_env(dr_model, real_env)
print(f"\nDomain-randomized policy:")
print(f" Simulation: {dr_sim_perf:.1f}")
print(f" 'Real world': {dr_real_perf:.1f}")
print(f"\nOriginal policy:")
print(f" Simulation: {sim_perf:.1f}")
print(f" 'Real world': {real_perf:.1f}")What just happened: Domain randomization forces the policy to handle a wide range of dynamics. It may perform slightly worse in the nominal simulation (because it can’t specialize) but much better in the modified “real” environment. This is the core of sim-to-real transfer.
Step 5: Test robustness across many variations
def robustness_test(model, n_variations=20, n_episodes_per=10):
"""Test policy across many physics variations."""
results = []
for i in range(n_variations):
test_env = ModifiedCartPole(
gravity=np.random.uniform(7.0, 15.0),
masscart=np.random.uniform(0.3, 2.5),
masspole=np.random.uniform(0.03, 0.4),
obs_noise_std=np.random.uniform(0.0, 0.15),
)
mean_r, _ = evaluate_in_env(model, test_env, n_episodes_per)
results.append(mean_r)
return results
standard_results = robustness_test(model)
dr_results = robustness_test(dr_model)
plt.figure(figsize=(10, 5))
plt.boxplot([standard_results, dr_results],
labels=["Standard training", "Domain randomized"])
plt.ylabel("Mean episode reward")
plt.title("Robustness across physics variations")
plt.grid(True, axis="y")
plt.savefig("sim2real_robustness.png", dpi=150)
plt.show()
print(f"Standard: mean={np.mean(standard_results):.1f}, "
f"min={np.min(standard_results):.1f}")
print(f"Domain-randomized: mean={np.mean(dr_results):.1f}, "
f"min={np.min(dr_results):.1f}")Step 6: System identification
Instead of randomizing everything, try to make the simulator match reality as closely as possible.
def system_identification(real_env, sim_env_class, n_samples=100):
"""Estimate real-world parameters by observing real transitions.
Collect data from real environment, fit simulator parameters.
In practice: record sensor data from real hardware,
optimize sim parameters to minimize prediction error.
"""
# Collect real data
real_data = []
env = real_env
obs, _ = env.reset()
for _ in range(n_samples):
action = env.action_space.sample()
next_obs, _, done, _, _ = env.step(action)
real_data.append((obs, action, next_obs))
if done:
obs, _ = env.reset()
else:
obs = next_obs
# Grid search over sim parameters (in practice: Bayesian optimization)
best_error = float("inf")
best_params = {}
for gravity in np.linspace(8, 14, 10):
for mass in np.linspace(0.5, 2.5, 10):
sim_env = sim_env_class(gravity=gravity, masscart=mass)
# Compute prediction error
error = 0
for obs, action, real_next in real_data:
sim_env.state = obs
sim_next, _, _, _, _ = sim_env.step(action)
error += np.sum((sim_next - real_next) ** 2)
if error < best_error:
best_error = error
best_params = {"gravity": gravity, "masscart": mass}
print(f"Best sim parameters: {best_params} (error: {best_error:.4f})")
return best_paramsStep 7: Progressive transfer
Train on increasingly realistic simulations:
def progressive_transfer(base_model, stages, steps_per_stage=30_000):
"""Fine-tune through progressively more realistic simulators."""
model = base_model
for i, params in enumerate(stages):
print(f"\nStage {i+1}: {params}")
stage_env = ModifiedCartPole(**params)
# Continue training from previous model
model.set_env(stage_env)
model.learn(total_timesteps=steps_per_stage, reset_num_timesteps=False)
# Evaluate
perf, _ = evaluate_in_env(model, stage_env, n_episodes=20)
print(f" Performance: {perf:.1f}")
return model
# Define progressive stages toward reality
stages = [
{"gravity": 10.0, "masscart": 1.1, "obs_noise_std": 0.01}, # slight change
{"gravity": 11.0, "masscart": 1.3, "obs_noise_std": 0.03}, # more change
{"gravity": 12.0, "masscart": 1.5, "obs_noise_std": 0.05}, # target "reality"
]
# progressive_model = progressive_transfer(dr_model, stages)Real-world examples
OpenAI Rubik’s Cube (2019)
Trained a dexterous robot hand to solve a Rubik’s cube. Key techniques:
- Massive domain randomization (cube size, friction, mass, visual appearance)
- Automatic domain randomization (ADR): gradually increase randomization range based on policy performance
- Trained entirely in simulation, transferred zero-shot to real robot
Drone navigation
The standard sim-to-real pipeline for autonomous drones:
- Train in simulator (AirSim, Gazebo, Isaac Gym)
- Domain randomize: wind, turbulence, payload mass, motor response, sensor noise
- System identification: measure real drone parameters (motor curves, IMU bias)
- Fine-tune in high-fidelity simulation tuned to real drone
- Deploy with safety constraints (attitude limits, emergency stop)
This is THE bottleneck for autonomous FPV drones. The sim-to-real gap is significant because:
- Aerodynamics are chaotic (turbulence near obstacles)
- Motor dynamics are nonlinear
- Visual appearance differs (lighting, texture)
- Latency varies (real hardware has processing delays)
Boston Dynamics
Their robots train fundamental locomotion skills in simulation with domain randomization, then fine-tune on hardware with conservative safety margins.
Summary of approaches
| Approach | Idea | Effort | Robustness |
|---|---|---|---|
| Domain randomization | Randomize everything during training | Low | Good if range covers reality |
| System identification | Tune simulator to match real system | High (needs real data) | Best for known system |
| Progressive transfer | Gradual adaptation toward reality | Medium | Good, controlled |
| Fine-tuning on real | Train briefly on real hardware | Needs real access | Best, but expensive |
| Meta-learning | Learn to adapt quickly to new dynamics | High | Good for new environments |
In practice, combine them: system identification to get a reasonable sim, domain randomization for robustness, progressive transfer if you have real access.
Self-test questions
- Why does a policy trained in simulation typically fail when deployed to real hardware?
- What is domain randomization, and why does it improve transfer?
- When would system identification be preferred over domain randomization?
- Why is progressive transfer useful when the target domain is known?
- What makes drone sim-to-real particularly challenging compared to manipulation?
Exercises
- Quantify the gap: Train on standard CartPole, test on 50 random variations. Plot reward vs gravity deviation. At what gravity deviation does the policy fail?
- Adaptive randomization: Start with narrow randomization ranges. If the agent succeeds > 80% of the time, widen the ranges. Implement this adaptive schedule and compare with fixed randomization.
- Visual domain randomization: If you have access to a simulator with visual output (e.g., Gymnasium with render), randomize visual properties (background color, object texture). Test if a vision-based policy transfers across visual variations.
Links
- Model-Based RL — simulation as world model
- Actor-Critic and PPO — PPO for policy training
- Reward Design and Curriculum — domain randomization as curriculum
- Case Study - RL System Design — drone navigation design
- RL Fundamentals — MDP assumptions that sim-to-real violates