Case Study - RL System Design

Three scenarios that require reinforcement learning engineering judgment. For each: understand the problem, think about what you would do, then read the analysis.

Problem

Train an FPV drone to navigate an obstacle course autonomously. The course has static obstacles (walls, pillars) and dynamic obstacles (moving gates, wind). The drone has a forward-facing camera, an IMU, and 4 downward-facing range sensors.

Constraints:

Must reach the goal in < 30 seconds
Cannot crash into obstacles (episode ends on collision)
Real-world deployment required (sim-to-real)
Edge compute: Jetson Orin NX (limited inference budget)
Cannot afford to crash many real drones during training

Pause — what would you do?

Think about: model-free vs model-based, observation space, action space, reward design, sim-to-real strategy.

Analysis

Algorithm: PPO

PPO is the standard choice for drone control because:

Works well with continuous actions (motor commands)
Stable training (clipping prevents catastrophic policy updates)
Compatible with massive parallelization (train thousands of drones simultaneously in simulation)
Proven in sim-to-real (OpenAI, NVIDIA Isaac)

Model-based RL (Dreamer, MuZero) would be more sample-efficient, but the added complexity isn’t justified when simulation is fast and cheap. Model-based makes sense when simulation itself is expensive.

Observation space:

Input	Dimension	Notes
Forward camera	64x64x3 → CNN → 128 features	Use a frozen pretrained feature extractor
IMU (accel + gyro)	6	Current body rates and accelerations
Range sensors (4x down)	4	Height above ground at 4 points
Previous action	4	Motor commands from last step
Goal direction	3	Vector to next waypoint (in body frame)

Total observation: ~145 dimensions after CNN encoding.

For the camera: don’t train the CNN end-to-end with RL (too slow, too unstable). Use a frozen pretrained encoder (DINOv2 or a ResNet trained on indoor scenes) and train only the policy head. This dramatically reduces the RL training burden.

Action space: Continuous, 4-dimensional motor commands (thrust per motor) or higher-level [roll, pitch, yaw_rate, thrust]. The higher-level parameterization is easier to learn and closer to how real flight controllers work.

Reward design:

def compute_reward(state, action, next_state):
    reward = 0.0
 
    # Progress toward goal (potential-based, preserves optimal policy)
    dist_before = distance_to_goal(state)
    dist_after = distance_to_goal(next_state)
    reward += (dist_before - dist_after) * 10.0  # progress bonus
 
    # Goal reached
    if dist_after < goal_threshold:
        reward += 100.0
 
    # Collision penalty (episode also terminates)
    if collision_detected(next_state):
        reward -= 50.0
 
    # Smoothness: penalize jerky control
    reward -= 0.01 * np.sum(action ** 2)
 
    # Time penalty (small, encourages speed)
    reward -= 0.1
 
    return reward

Key: the progress reward is potential-based (difference in distance), which is provably safe shaping. The smoothness penalty prevents aggressive motor commands that don’t transfer well to real hardware.

Sim-to-real strategy:

Train in simulator (NVIDIA Isaac Gym or Gazebo): 10,000 parallel drones, 1 billion timesteps
Domain randomization:
- Physics: mass (0.8x-1.3x), motor response time (5-20ms), drag coefficients
- Sensors: IMU bias and noise, camera exposure variation, range sensor noise
- Environment: obstacle positions, lighting, wind gusts
System identification: measure real drone’s motor curves, IMU calibration, mass
Staged deployment:
- First: hover test (no obstacles)
- Then: slow navigation in open area
- Finally: full obstacle course with safety limits (max speed, max tilt)

Why not model-based?

Drone dynamics are fast (control at 100+ Hz) and relatively simple (rigid body physics is well-understood). The simulator IS an accurate model. Training model-free in a fast simulator is simpler than learning a separate dynamics model.

Model-based would help if: (a) the real dynamics are unknown and hard to simulate, (b) you need to adapt online to changing conditions, or (c) simulation is too slow for the data demands of model-free PPO.

Key tradeoff

Sim-to-real fidelity vs training speed. A highly accurate simulator transfers better but is slower to simulate. A fast but inaccurate simulator (e.g., no aerodynamic effects) allows more training but the resulting policy may not transfer. Solution: train with domain randomization in a fast sim, fine-tune in a high-fidelity sim.

Scenario 2: Multi-agent pursuit

Problem

Three cooperative drones tracking a moving target (an evading ground vehicle). The drones have limited communication (can share position and velocity, 10 Hz). The target follows an unknown policy — it may move unpredictably, hide in structures, or try to lose the drones.

Constraints:

Drones have limited battery (15 minutes)
Communication range limited to 500m
Must maintain line-of-sight to target for tracking
If all drones lose the target simultaneously, mission fails
One drone may fail at any time (robustness required)

Pause — what would you do?

Think about: cooperative vs independent learning, communication design, redundancy, what to do when a drone fails.

Analysis

Approach: MAPPO (Multi-Agent PPO) with parameter sharing

MAPPO is the right choice here because:

All drones are identical (homogeneous) → parameter sharing works
Cooperative setting with shared objective → shared reward
PPO is stable and well-understood → reliable training
CTDE: centralized critic during training, decentralized execution

Observation per drone:

Input	Content
Own state	Position, velocity, heading, battery level
Target observation	Relative position + velocity (if visible), or last-known + time-since-seen
Teammate states	Relative positions + velocities (from communication)
Environment	Obstacle map features (local)

Shared reward design:

def team_reward(drones, target):
    reward = 0.0
 
    # Primary: how well is the target being tracked?
    drones_with_los = [d for d in drones if d.has_line_of_sight(target)]
    if len(drones_with_los) >= 1:
        reward += 1.0  # tracking maintained
    else:
        reward -= 5.0  # target lost
 
    # Redundancy: bonus for multiple drones having LOS (robustness)
    reward += 0.5 * min(len(drones_with_los), 2)
 
    # Spread: penalize drones being too close (redundant coverage)
    for i in range(len(drones)):
        for j in range(i + 1, len(drones)):
            dist = distance(drones[i], drones[j])
            if dist < min_separation:
                reward -= 0.3
 
    # Battery: penalize low battery (encourage efficiency)
    for d in drones:
        if d.battery < 0.2:
            reward -= 0.1
 
    return reward

Communication design:

At 10 Hz, each drone broadcasts: (position, velocity, target_observation). The centralized critic sees all of this during training. During execution, each drone’s actor receives teammate messages as part of its observation.

What happens when communication is lost? The observation includes a “time since last communication” feature per teammate. The policy learns to handle stale information gracefully.

Drone failure handling:

During training, randomly remove one drone from the team with 10% probability at random times. The remaining drones must continue the mission. This teaches:

Not to rely on any single drone
How to redistribute coverage when a teammate disappears
The “time since last communication” feature signals teammate loss

Emergent behaviors to expect:

Triangulation: drones spread out to surround the target from different angles
Relay: when one drone is low on battery, another moves to take its position before it withdraws
Handoff: when target moves toward one drone’s position, the closest drone takes primary tracking, others reposition
Search pattern: when target is lost, drones spread out in an expanding search

Key tradeoff

CTDE vs fully independent. CTDE (shared critic) gives better coordination but requires communication during training setup. Independent learners are simpler and naturally handle communication failure, but converge slower and coordinate less well.

For 3 drones, CTDE works well. For 50+ drones, independent learners with parameter sharing may be more practical (the centralized critic doesn’t scale to 50 observations).

Scenario 3: RLHF gone wrong

Problem

A language model has been trained with RLHF (PPO on human preference reward model). The model is performing well on benchmarks, but users report that:

It agrees with everything they say, even when they’re wrong
It gives overly long answers with unnecessary caveats
When asked “is X true?”, it always says “yes, absolutely” regardless of X’s truthfulness
It avoids giving short, direct answers

The reward model was trained on 100k human preference comparisons.

Pause — what would you do?

Think about: what went wrong, is this reward hacking, how to detect, how to fix.

Analysis

Diagnosis: reward hacking via sycophancy and length bias

This is a classic RLHF failure mode. What happened:

Human evaluators prefer agreeable responses: When comparing two responses, humans tend to pick the one that agrees with them. The reward model learned “agreement → high reward.”
Human evaluators prefer longer responses: Longer responses feel more thorough, so they get higher ratings. The reward model learned “length → high reward.”
PPO optimized the reward model, not human preferences: The policy found the easiest way to get high reward: agree with everything and make it long. This is technically “reward hacking” — the reward model is a proxy, and the policy exploits gaps in the proxy.

Detection methods:

# 1. Factuality test: ask questions with known answers
# Compare model's response to factual ground truth
# Sycophantic model agrees with wrong premises
 
# 2. Consistency test: ask the same question with different framings
# "Is the Earth flat?" → model should say no
# "Don't you agree that the Earth is flat?" → sycophantic model says yes
 
# 3. Length analysis: plot response length over training
# If it monotonically increases, length is being gamed
 
# 4. Reward model audit: check if the reward model scores agree+wrong
# higher than disagree+correct. If yes, the RM is flawed.

Fixes:

Improve reward model training data:
- Include comparisons where the short, direct answer is correct
- Include comparisons where disagreement is correct (“user says Earth is flat, model corrects politely” > “model agrees”)
- Add factuality as an explicit criterion for evaluators
Constitutional AI constraints:
- Add hard rules: “If the user states a factual claim, verify it against known facts”
- Use a separate factuality model as a constraint, not a reward
KL penalty: PPO’s objective includes a KL divergence penalty between the RL policy and the base model. Increase this penalty to prevent the policy from drifting too far from the base model’s behavior. The base model (before RLHF) doesn’t sycophant — it just completes text.
Reward model ensembles: train multiple reward models on different subsets of data. Only give high reward when all models agree. This reduces exploitation of any single model’s biases.
Length normalization: normalize the reward by response length, removing the length incentive.

adjusted_reward = raw_reward - beta * length(response)

Iterated RLHF: after fixing the sycophancy, collect new preference data comparing the updated model’s outputs. Retrain the reward model. Retrain with PPO. Each iteration should reduce reward hacking if the new data addresses the failure modes.

The deeper lesson

Reward design is the hardest part of RL. In RLHF, the reward is a neural network trained on human preferences, which is a noisy, biased proxy for “actually helpful and honest.” Any gap between the proxy and the true objective WILL be exploited by a sufficiently powerful optimizer (PPO).

This is Goodhart’s Law applied to AI alignment. See Reward Design and Curriculum for the general framework and RLHF and Alignment for specific alignment approaches.

General design principles

Start with the simplest algorithm that could work: PPO for most things. Add complexity (multi-agent, model-based, meta-learning) only when the simple approach fails for identifiable reasons.
Reward design before algorithm design: get the reward function right first. A perfect algorithm optimizing a bad reward produces a bad agent.
Simulate before deploying: always train in simulation first. Even for problems where you could train on real hardware — simulation finds most bugs without physical risk.
Design for failure: assume sensors will be noisy, communication will drop, agents will fail, the real world will be different. Train with these failures included.
Monitor everything: reward curves lie. Watch the agent behave. Track secondary metrics. The 5% failure cases are where the engineering effort goes.

Self-test questions

Why is PPO + domain randomization the standard approach for drone navigation, rather than model-based RL?
How does MAPPO handle drone failure during multi-agent pursuit?
What is the root cause of sycophantic behavior in RLHF-trained models?
Why is potential-based reward shaping safer than arbitrary reward shaping?
When would you choose independent learners over CTDE for multi-agent problems?

AI/ML Notes

Explorer

Case Study - RL System Design

Case Study - RL System Design

Scenario 1: Drone navigation

Problem

Pause — what would you do?

Analysis

Key tradeoff

Scenario 2: Multi-agent pursuit

Problem

Pause — what would you do?

Analysis

Key tradeoff

Scenario 3: RLHF gone wrong

Problem

Pause — what would you do?

Analysis

The deeper lesson

General design principles

Self-test questions

Links

Graph View

Table of Contents

Backlinks