Reward Design and Curriculum
What
The reward function is the most important and most underrated part of RL. It defines what the agent should learn. A bad reward function leads to a bad agent — not because the algorithm failed, but because you specified the wrong objective.
This note covers: reward shaping, reward hacking, sparse vs dense rewards, intrinsic motivation, curriculum learning, and RLHF.
The reward design problem
You want a drone to navigate to a target. What reward do you give?
| Reward design | What happens |
|---|---|
| +1 at target, 0 elsewhere | Agent can’t learn — never reaches target by random exploration (sparse reward) |
| -1 per step (time penalty) | Agent learns to reach target fast, but may take shortcuts through obstacles |
| -distance_to_target per step | Agent moves toward target, but may oscillate at local minima |
| -distance + obstacle_penalty + time_penalty | Usually works, but needs careful tuning of weights |
Every reward function is a hypothesis about what you want. Testing that hypothesis is the engineering challenge.
Reward shaping
Add intermediate rewards to guide the agent toward the goal. Instead of only rewarding the final outcome, reward progress toward it.
Potential-based reward shaping
Ng et al. (1999) proved that if you shape the reward using a potential function Phi(s), the optimal policy is guaranteed to be preserved:
shaped_reward = original_reward + gamma * Phi(s') - Phi(s)
This adds a “bonus” for moving from low-potential states to high-potential states, without changing which policy is optimal.
import numpy as np
def potential_based_shaping(state, next_state, original_reward, gamma,
potential_fn):
"""Add potential-based reward shaping.
potential_fn: callable that takes a state and returns a scalar.
Guarantees: optimal policy is preserved.
"""
phi_s = potential_fn(state)
phi_s_next = potential_fn(next_state)
shaped_reward = original_reward + gamma * phi_s_next - phi_s
return shaped_reward
# Example: for navigation, potential = negative distance to goal
def navigation_potential(state, goal=np.array([10.0, 10.0])):
return -np.linalg.norm(state[:2] - goal)
# The shaped reward naturally gives positive bonus for moving toward goal
# and negative penalty for moving away, without changing the optimal policyNon-potential shaping (dangerous)
If your shaping reward is NOT potential-based, it can change the optimal policy. The agent optimizes the shaped reward, which may not correspond to what you actually want.
Example: giving a constant +1 reward for being in a “good” region can cause the agent to stay there forever instead of progressing to the actual goal.
Reward hacking
The agent finds a loophole in your reward function that gives high reward without actually doing what you intended.
Famous examples
- Coast Runners boat race: agent found that collecting power-ups in circles scored more than finishing the race
- Robot hand: rewarded for moving a ball to a target position, but the hand learned to move itself near the sensor instead of actually placing the ball
- Evolution simulation: creatures evolved to be very tall and then fall over — the reward was “velocity” and falling is fast
- RLHF sycophancy: language model learns to agree with everything the user says because users prefer agreement (reward hacking the human evaluator)
Why it happens
The reward function is a proxy for what you actually want. The agent optimizes the proxy exactly, and any gap between the proxy and the true objective is exploitable. This is Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
How to detect and fix
- Watch the agent play: does the behavior look right? Metrics can lie.
- Multiple metrics: track several measures, not just the reward. If reward goes up but other metrics don’t, something is wrong.
- Ablation: remove parts of the reward and see what the agent does.
- Constrained optimization: add hard constraints (safety limits) that can’t be hacked.
Sparse vs dense rewards
Sparse rewards
Only give reward when the task is completed (or failed). Honest but hard to learn from.
Navigate to target: +1 when reached, 0 otherwise
Pick up object: +1 when grasped, 0 otherwise
Win the game: +1 at end if won, -1 if lost
Problem: the agent may never reach the reward by random exploration. With 1000 steps and one reward at the end, the agent gets no learning signal for the other 999 steps.
Hindsight Experience Replay (HER)
Clever solution for sparse rewards: if the agent failed to reach goal G but ended up at state S, relabel the experience with S as the goal. The agent “succeeded” at reaching S, so it gets reward and learns something useful.
def hindsight_relabel(trajectory, achieved_goal):
"""Relabel a failed trajectory with achieved goal.
trajectory: list of (state, action, reward, next_state, goal).
achieved_goal: where the agent actually ended up.
"""
relabeled = []
for state, action, _, next_state, _ in trajectory:
# Check if agent reached the new (hindsight) goal
new_reward = 1.0 if np.linalg.norm(next_state[:2] - achieved_goal) < 0.5 else 0.0
relabeled.append((state, action, new_reward, next_state, achieved_goal))
return relabeledIntrinsic motivation
Give the agent internal reward for exploring novel states, independent of the task reward. The agent is “curious.”
Curiosity-driven exploration (ICM)
The Intrinsic Curiosity Module gives bonus reward proportional to prediction error:
intrinsic_reward = || predicted_next_feature - actual_next_feature ||^2
If the agent can’t predict what will happen, the situation is novel and worth exploring.
import torch
import torch.nn as nn
class ICM(nn.Module):
"""Intrinsic Curiosity Module.
Gives high reward for states the agent can't predict.
"""
def __init__(self, obs_dim, action_dim, feature_dim=64):
super().__init__()
# Encode observations to features
self.encoder = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
# Forward model: predict next features from current features + action
self.forward_model = nn.Sequential(
nn.Linear(feature_dim + action_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
def intrinsic_reward(self, obs, action, next_obs):
"""Compute curiosity reward."""
phi_s = self.encoder(obs)
phi_s_next = self.encoder(next_obs)
# One-hot action if discrete
pred_phi_next = self.forward_model(torch.cat([phi_s, action], dim=-1))
# Prediction error = surprise = curiosity reward
reward = 0.5 * (pred_phi_next - phi_s_next.detach()).pow(2).sum(dim=-1)
return reward
def loss(self, obs, action, next_obs):
"""Train the forward model to reduce prediction error on visited states."""
phi_s = self.encoder(obs)
phi_s_next = self.encoder(next_obs)
pred_phi_next = self.forward_model(torch.cat([phi_s, action], dim=-1))
return nn.functional.mse_loss(pred_phi_next, phi_s_next.detach())Random Network Distillation (RND)
Simpler than ICM. Uses a fixed random network as target, and a predictor network that learns to match it. States that have been seen many times have low prediction error (the predictor has learned them). Novel states have high prediction error.
class RND(nn.Module):
"""Random Network Distillation for exploration bonus."""
def __init__(self, obs_dim, feature_dim=64):
super().__init__()
# Fixed random target (never trained)
self.target = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
for p in self.target.parameters():
p.requires_grad = False
# Predictor (trained to match target)
self.predictor = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
def exploration_bonus(self, obs):
"""High bonus for novel states (predictor hasn't learned them yet)."""
with torch.no_grad():
target_feat = self.target(obs)
pred_feat = self.predictor(obs)
return (pred_feat - target_feat).pow(2).sum(dim=-1)Curriculum learning
Start with easy tasks, gradually increase difficulty. Humans learn this way — you don’t hand a calculus textbook to someone who hasn’t done arithmetic.
Manual curriculum
Design a sequence of progressively harder environments.
def get_curriculum_env(difficulty):
"""Create environments of increasing difficulty."""
import gymnasium as gym
if difficulty == 0:
# Easy: wide corridor, no obstacles
return gym.make("CartPole-v1")
elif difficulty == 1:
# Medium: standard environment
return gym.make("LunarLander-v2")
elif difficulty == 2:
# Hard: continuous control
return gym.make("BipedalWalker-v3")
def train_with_curriculum(agent, difficulties, steps_per_level=100_000):
"""Train agent through progressive difficulty levels."""
for diff in difficulties:
env = get_curriculum_env(diff)
print(f"Training on difficulty {diff}...")
agent.learn(total_timesteps=steps_per_level, env=env)
print(f"Difficulty {diff} complete. Moving to next level.")Domain randomization
Instead of fixed curriculum, randomize environment parameters during training. The agent learns to be robust to variation.
def randomize_env_params():
"""Randomize environment parameters for robust training.
Used in sim-to-real transfer (see Tutorial - Sim-to-Real Transfer).
"""
params = {
"gravity": np.random.uniform(8.0, 12.0), # Earth ~9.8
"friction": np.random.uniform(0.5, 1.5),
"wind": np.random.uniform(-2.0, 2.0),
"obs_noise_std": np.random.uniform(0.0, 0.1),
}
return paramsAutomatic curriculum generation
Let the environment difficulty adapt to the agent’s performance:
- If the agent succeeds > 80% of the time → make it harder
- If the agent succeeds < 20% of the time → make it easier
- Keep the agent in the “zone of proximal development”
RLHF: human feedback as reward
Reinforcement Learning from Human Feedback. Instead of a programmatic reward, use human preferences.
1. Generate two outputs (a, b) from the model
2. Human says: "a is better than b"
3. Train a reward model to predict human preferences
4. Use the reward model as the RL reward signal
5. Optimize the policy with PPO
This is how ChatGPT, Claude, and most modern language models are trained. See RLHF and Alignment for details.
The challenge: the reward model is a learned proxy for human preferences. If it’s wrong (or if users have systematic biases), the agent hacks the reward model instead of actually being helpful. This is reward hacking at the meta-level.
Practical advice
- Start with the simplest reward that could work. Dense, hand-crafted, explicit. Only add complexity when the simple version fails.
- Watch your agent. Don’t just look at reward curves. Visualize the behavior. Reward hacking is invisible in metrics.
- Use potential-based shaping if you need to guide the agent. It’s the only provably safe shaping method.
- Multiple reward terms? Normalize them. If term A ranges [-100, 100] and term B ranges [-1, 1], term A dominates and term B is ignored.
- Curriculum helps more than reward shaping in many cases. Making the task easier to explore is better than adding artificial rewards.
Self-test questions
- What is Goodhart’s Law, and how does it apply to reward design in RL?
- Why does potential-based reward shaping preserve the optimal policy, while arbitrary shaping might not?
- Give three examples of reward hacking. What do they have in common?
- When would you prefer intrinsic motivation (ICM/RND) over reward shaping?
- What is the relationship between RLHF and reward hacking?
Exercises
- Shaped vs sparse: Train PPO on a navigation task with sparse reward (+1 at goal) and with shaped reward (-distance per step). Compare learning curves and final behavior. Does the shaped reward lead to different behavior than intended?
- Curiosity exploration: Implement RND and add it as exploration bonus in a sparse-reward environment (MountainCar-v0 is a classic choice). Compare exploration behavior with and without curiosity.
- Design a curriculum: For the BipedalWalker-v3 environment, design a 3-stage curriculum (easy → medium → hard terrain). Train with and without curriculum. Report learning speed difference.
Links
- RL Fundamentals — reward function in the MDP framework
- Actor-Critic and PPO — PPO uses the reward you design
- Multi-Agent RL — reward design is harder in multi-agent (credit assignment)
- RLHF and Alignment — human feedback as reward
- Tutorial - PPO from Scratch — implement the agent that uses these rewards
- Tutorial - Sim-to-Real Transfer — domain randomization as curriculum