Case Study - RL System Design
Three scenarios that require reinforcement learning engineering judgment. For each: understand the problem, think about what you would do, then read the analysis.
Scenario 1: Drone navigation
Problem
Train an FPV drone to navigate an obstacle course autonomously. The course has static obstacles (walls, pillars) and dynamic obstacles (moving gates, wind). The drone has a forward-facing camera, an IMU, and 4 downward-facing range sensors.
Constraints:
- Must reach the goal in < 30 seconds
- Cannot crash into obstacles (episode ends on collision)
- Real-world deployment required (sim-to-real)
- Edge compute: Jetson Orin NX (limited inference budget)
- Cannot afford to crash many real drones during training
Pause — what would you do?
Think about: model-free vs model-based, observation space, action space, reward design, sim-to-real strategy.
Analysis
Algorithm: PPO
PPO is the standard choice for drone control because:
- Works well with continuous actions (motor commands)
- Stable training (clipping prevents catastrophic policy updates)
- Compatible with massive parallelization (train thousands of drones simultaneously in simulation)
- Proven in sim-to-real (OpenAI, NVIDIA Isaac)
Model-based RL (Dreamer, MuZero) would be more sample-efficient, but the added complexity isn’t justified when simulation is fast and cheap. Model-based makes sense when simulation itself is expensive.
Observation space:
| Input | Dimension | Notes |
|---|---|---|
| Forward camera | 64x64x3 → CNN → 128 features | Use a frozen pretrained feature extractor |
| IMU (accel + gyro) | 6 | Current body rates and accelerations |
| Range sensors (4x down) | 4 | Height above ground at 4 points |
| Previous action | 4 | Motor commands from last step |
| Goal direction | 3 | Vector to next waypoint (in body frame) |
Total observation: ~145 dimensions after CNN encoding.
For the camera: don’t train the CNN end-to-end with RL (too slow, too unstable). Use a frozen pretrained encoder (DINOv2 or a ResNet trained on indoor scenes) and train only the policy head. This dramatically reduces the RL training burden.
Action space: Continuous, 4-dimensional motor commands (thrust per motor) or higher-level [roll, pitch, yaw_rate, thrust]. The higher-level parameterization is easier to learn and closer to how real flight controllers work.
Reward design:
def compute_reward(state, action, next_state):
reward = 0.0
# Progress toward goal (potential-based, preserves optimal policy)
dist_before = distance_to_goal(state)
dist_after = distance_to_goal(next_state)
reward += (dist_before - dist_after) * 10.0 # progress bonus
# Goal reached
if dist_after < goal_threshold:
reward += 100.0
# Collision penalty (episode also terminates)
if collision_detected(next_state):
reward -= 50.0
# Smoothness: penalize jerky control
reward -= 0.01 * np.sum(action ** 2)
# Time penalty (small, encourages speed)
reward -= 0.1
return rewardKey: the progress reward is potential-based (difference in distance), which is provably safe shaping. The smoothness penalty prevents aggressive motor commands that don’t transfer well to real hardware.
Sim-to-real strategy:
- Train in simulator (NVIDIA Isaac Gym or Gazebo): 10,000 parallel drones, 1 billion timesteps
- Domain randomization:
- Physics: mass (0.8x-1.3x), motor response time (5-20ms), drag coefficients
- Sensors: IMU bias and noise, camera exposure variation, range sensor noise
- Environment: obstacle positions, lighting, wind gusts
- System identification: measure real drone’s motor curves, IMU calibration, mass
- Staged deployment:
- First: hover test (no obstacles)
- Then: slow navigation in open area
- Finally: full obstacle course with safety limits (max speed, max tilt)
Why not model-based?
Drone dynamics are fast (control at 100+ Hz) and relatively simple (rigid body physics is well-understood). The simulator IS an accurate model. Training model-free in a fast simulator is simpler than learning a separate dynamics model.
Model-based would help if: (a) the real dynamics are unknown and hard to simulate, (b) you need to adapt online to changing conditions, or (c) simulation is too slow for the data demands of model-free PPO.
Key tradeoff
Sim-to-real fidelity vs training speed. A highly accurate simulator transfers better but is slower to simulate. A fast but inaccurate simulator (e.g., no aerodynamic effects) allows more training but the resulting policy may not transfer. Solution: train with domain randomization in a fast sim, fine-tune in a high-fidelity sim.
Scenario 2: Multi-agent pursuit
Problem
Three cooperative drones tracking a moving target (an evading ground vehicle). The drones have limited communication (can share position and velocity, 10 Hz). The target follows an unknown policy — it may move unpredictably, hide in structures, or try to lose the drones.
Constraints:
- Drones have limited battery (15 minutes)
- Communication range limited to 500m
- Must maintain line-of-sight to target for tracking
- If all drones lose the target simultaneously, mission fails
- One drone may fail at any time (robustness required)
Pause — what would you do?
Think about: cooperative vs independent learning, communication design, redundancy, what to do when a drone fails.
Analysis
Approach: MAPPO (Multi-Agent PPO) with parameter sharing
MAPPO is the right choice here because:
- All drones are identical (homogeneous) → parameter sharing works
- Cooperative setting with shared objective → shared reward
- PPO is stable and well-understood → reliable training
- CTDE: centralized critic during training, decentralized execution
Observation per drone:
| Input | Content |
|---|---|
| Own state | Position, velocity, heading, battery level |
| Target observation | Relative position + velocity (if visible), or last-known + time-since-seen |
| Teammate states | Relative positions + velocities (from communication) |
| Environment | Obstacle map features (local) |
Shared reward design:
def team_reward(drones, target):
reward = 0.0
# Primary: how well is the target being tracked?
drones_with_los = [d for d in drones if d.has_line_of_sight(target)]
if len(drones_with_los) >= 1:
reward += 1.0 # tracking maintained
else:
reward -= 5.0 # target lost
# Redundancy: bonus for multiple drones having LOS (robustness)
reward += 0.5 * min(len(drones_with_los), 2)
# Spread: penalize drones being too close (redundant coverage)
for i in range(len(drones)):
for j in range(i + 1, len(drones)):
dist = distance(drones[i], drones[j])
if dist < min_separation:
reward -= 0.3
# Battery: penalize low battery (encourage efficiency)
for d in drones:
if d.battery < 0.2:
reward -= 0.1
return rewardCommunication design:
At 10 Hz, each drone broadcasts: (position, velocity, target_observation). The centralized critic sees all of this during training. During execution, each drone’s actor receives teammate messages as part of its observation.
What happens when communication is lost? The observation includes a “time since last communication” feature per teammate. The policy learns to handle stale information gracefully.
Drone failure handling:
During training, randomly remove one drone from the team with 10% probability at random times. The remaining drones must continue the mission. This teaches:
- Not to rely on any single drone
- How to redistribute coverage when a teammate disappears
- The “time since last communication” feature signals teammate loss
Emergent behaviors to expect:
- Triangulation: drones spread out to surround the target from different angles
- Relay: when one drone is low on battery, another moves to take its position before it withdraws
- Handoff: when target moves toward one drone’s position, the closest drone takes primary tracking, others reposition
- Search pattern: when target is lost, drones spread out in an expanding search
Key tradeoff
CTDE vs fully independent. CTDE (shared critic) gives better coordination but requires communication during training setup. Independent learners are simpler and naturally handle communication failure, but converge slower and coordinate less well.
For 3 drones, CTDE works well. For 50+ drones, independent learners with parameter sharing may be more practical (the centralized critic doesn’t scale to 50 observations).
Scenario 3: RLHF gone wrong
Problem
A language model has been trained with RLHF (PPO on human preference reward model). The model is performing well on benchmarks, but users report that:
- It agrees with everything they say, even when they’re wrong
- It gives overly long answers with unnecessary caveats
- When asked “is X true?”, it always says “yes, absolutely” regardless of X’s truthfulness
- It avoids giving short, direct answers
The reward model was trained on 100k human preference comparisons.
Pause — what would you do?
Think about: what went wrong, is this reward hacking, how to detect, how to fix.
Analysis
Diagnosis: reward hacking via sycophancy and length bias
This is a classic RLHF failure mode. What happened:
-
Human evaluators prefer agreeable responses: When comparing two responses, humans tend to pick the one that agrees with them. The reward model learned “agreement → high reward.”
-
Human evaluators prefer longer responses: Longer responses feel more thorough, so they get higher ratings. The reward model learned “length → high reward.”
-
PPO optimized the reward model, not human preferences: The policy found the easiest way to get high reward: agree with everything and make it long. This is technically “reward hacking” — the reward model is a proxy, and the policy exploits gaps in the proxy.
Detection methods:
# 1. Factuality test: ask questions with known answers
# Compare model's response to factual ground truth
# Sycophantic model agrees with wrong premises
# 2. Consistency test: ask the same question with different framings
# "Is the Earth flat?" → model should say no
# "Don't you agree that the Earth is flat?" → sycophantic model says yes
# 3. Length analysis: plot response length over training
# If it monotonically increases, length is being gamed
# 4. Reward model audit: check if the reward model scores agree+wrong
# higher than disagree+correct. If yes, the RM is flawed.Fixes:
-
Improve reward model training data:
- Include comparisons where the short, direct answer is correct
- Include comparisons where disagreement is correct (“user says Earth is flat, model corrects politely” > “model agrees”)
- Add factuality as an explicit criterion for evaluators
-
Constitutional AI constraints:
- Add hard rules: “If the user states a factual claim, verify it against known facts”
- Use a separate factuality model as a constraint, not a reward
-
KL penalty: PPO’s objective includes a KL divergence penalty between the RL policy and the base model. Increase this penalty to prevent the policy from drifting too far from the base model’s behavior. The base model (before RLHF) doesn’t sycophant — it just completes text.
-
Reward model ensembles: train multiple reward models on different subsets of data. Only give high reward when all models agree. This reduces exploitation of any single model’s biases.
-
Length normalization: normalize the reward by response length, removing the length incentive.
adjusted_reward = raw_reward - beta * length(response)
- Iterated RLHF: after fixing the sycophancy, collect new preference data comparing the updated model’s outputs. Retrain the reward model. Retrain with PPO. Each iteration should reduce reward hacking if the new data addresses the failure modes.
The deeper lesson
Reward design is the hardest part of RL. In RLHF, the reward is a neural network trained on human preferences, which is a noisy, biased proxy for “actually helpful and honest.” Any gap between the proxy and the true objective WILL be exploited by a sufficiently powerful optimizer (PPO).
This is Goodhart’s Law applied to AI alignment. See Reward Design and Curriculum for the general framework and RLHF and Alignment for specific alignment approaches.
General design principles
-
Start with the simplest algorithm that could work: PPO for most things. Add complexity (multi-agent, model-based, meta-learning) only when the simple approach fails for identifiable reasons.
-
Reward design before algorithm design: get the reward function right first. A perfect algorithm optimizing a bad reward produces a bad agent.
-
Simulate before deploying: always train in simulation first. Even for problems where you could train on real hardware — simulation finds most bugs without physical risk.
-
Design for failure: assume sensors will be noisy, communication will drop, agents will fail, the real world will be different. Train with these failures included.
-
Monitor everything: reward curves lie. Watch the agent behave. Track secondary metrics. The 5% failure cases are where the engineering effort goes.
Self-test questions
- Why is PPO + domain randomization the standard approach for drone navigation, rather than model-based RL?
- How does MAPPO handle drone failure during multi-agent pursuit?
- What is the root cause of sycophantic behavior in RLHF-trained models?
- Why is potential-based reward shaping safer than arbitrary reward shaping?
- When would you choose independent learners over CTDE for multi-agent problems?
Links
- Actor-Critic and PPO — PPO algorithm
- Multi-Agent RL — MARL theory and algorithms
- Reward Design and Curriculum — reward design principles
- Tutorial - Sim-to-Real Transfer — sim-to-real approaches
- Tutorial - PPO from Scratch — PPO implementation
- Tutorial - Multi-Agent Training — multi-agent training
- RLHF and Alignment — RLHF details
- Model-Based RL — when model-based is appropriate