Reward Design and Curriculum

What

The reward function is the most important and most underrated part of RL. It defines what the agent should learn. A bad reward function leads to a bad agent — not because the algorithm failed, but because you specified the wrong objective.

This note covers: reward shaping, reward hacking, sparse vs dense rewards, intrinsic motivation, curriculum learning, and RLHF.

The reward design problem

You want a drone to navigate to a target. What reward do you give?

Reward design	What happens
+1 at target, 0 elsewhere	Agent can’t learn — never reaches target by random exploration (sparse reward)
-1 per step (time penalty)	Agent learns to reach target fast, but may take shortcuts through obstacles
-distance_to_target per step	Agent moves toward target, but may oscillate at local minima
-distance + obstacle_penalty + time_penalty	Usually works, but needs careful tuning of weights

Every reward function is a hypothesis about what you want. Testing that hypothesis is the engineering challenge.

Reward shaping

Add intermediate rewards to guide the agent toward the goal. Instead of only rewarding the final outcome, reward progress toward it.

Potential-based reward shaping

Ng et al. (1999) proved that if you shape the reward using a potential function Phi(s), the optimal policy is guaranteed to be preserved:

shaped_reward = original_reward + gamma * Phi(s') - Phi(s)

This adds a “bonus” for moving from low-potential states to high-potential states, without changing which policy is optimal.

import numpy as np
 
def potential_based_shaping(state, next_state, original_reward, gamma, 
                            potential_fn):
    """Add potential-based reward shaping.
    potential_fn: callable that takes a state and returns a scalar.
    Guarantees: optimal policy is preserved.
    """
    phi_s = potential_fn(state)
    phi_s_next = potential_fn(next_state)
    shaped_reward = original_reward + gamma * phi_s_next - phi_s
    return shaped_reward
 
# Example: for navigation, potential = negative distance to goal
def navigation_potential(state, goal=np.array([10.0, 10.0])):
    return -np.linalg.norm(state[:2] - goal)
 
# The shaped reward naturally gives positive bonus for moving toward goal
# and negative penalty for moving away, without changing the optimal policy

Non-potential shaping (dangerous)

If your shaping reward is NOT potential-based, it can change the optimal policy. The agent optimizes the shaped reward, which may not correspond to what you actually want.

Example: giving a constant +1 reward for being in a “good” region can cause the agent to stay there forever instead of progressing to the actual goal.

Reward hacking

The agent finds a loophole in your reward function that gives high reward without actually doing what you intended.

Famous examples

Coast Runners boat race: agent found that collecting power-ups in circles scored more than finishing the race
Robot hand: rewarded for moving a ball to a target position, but the hand learned to move itself near the sensor instead of actually placing the ball
Evolution simulation: creatures evolved to be very tall and then fall over — the reward was “velocity” and falling is fast
RLHF sycophancy: language model learns to agree with everything the user says because users prefer agreement (reward hacking the human evaluator)

Why it happens

The reward function is a proxy for what you actually want. The agent optimizes the proxy exactly, and any gap between the proxy and the true objective is exploitable. This is Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

How to detect and fix

Watch the agent play: does the behavior look right? Metrics can lie.
Multiple metrics: track several measures, not just the reward. If reward goes up but other metrics don’t, something is wrong.
Ablation: remove parts of the reward and see what the agent does.
Constrained optimization: add hard constraints (safety limits) that can’t be hacked.

Sparse vs dense rewards

Sparse rewards

Only give reward when the task is completed (or failed). Honest but hard to learn from.

Navigate to target: +1 when reached, 0 otherwise
Pick up object:     +1 when grasped, 0 otherwise
Win the game:       +1 at end if won, -1 if lost

Problem: the agent may never reach the reward by random exploration. With 1000 steps and one reward at the end, the agent gets no learning signal for the other 999 steps.

Hindsight Experience Replay (HER)

Clever solution for sparse rewards: if the agent failed to reach goal G but ended up at state S, relabel the experience with S as the goal. The agent “succeeded” at reaching S, so it gets reward and learns something useful.

def hindsight_relabel(trajectory, achieved_goal):
    """Relabel a failed trajectory with achieved goal.
    trajectory: list of (state, action, reward, next_state, goal).
    achieved_goal: where the agent actually ended up.
    """
    relabeled = []
    for state, action, _, next_state, _ in trajectory:
        # Check if agent reached the new (hindsight) goal
        new_reward = 1.0 if np.linalg.norm(next_state[:2] - achieved_goal) < 0.5 else 0.0
        relabeled.append((state, action, new_reward, next_state, achieved_goal))
    return relabeled

Intrinsic motivation

Give the agent internal reward for exploring novel states, independent of the task reward. The agent is “curious.”

Curiosity-driven exploration (ICM)

The Intrinsic Curiosity Module gives bonus reward proportional to prediction error:

intrinsic_reward = || predicted_next_feature - actual_next_feature ||^2

If the agent can’t predict what will happen, the situation is novel and worth exploring.

import torch
import torch.nn as nn
 
class ICM(nn.Module):
    """Intrinsic Curiosity Module.
    Gives high reward for states the agent can't predict.
    """
    def __init__(self, obs_dim, action_dim, feature_dim=64):
        super().__init__()
        # Encode observations to features
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, feature_dim),
        )
        # Forward model: predict next features from current features + action
        self.forward_model = nn.Sequential(
            nn.Linear(feature_dim + action_dim, 128), nn.ReLU(),
            nn.Linear(128, feature_dim),
        )
 
    def intrinsic_reward(self, obs, action, next_obs):
        """Compute curiosity reward."""
        phi_s = self.encoder(obs)
        phi_s_next = self.encoder(next_obs)
 
        # One-hot action if discrete
        pred_phi_next = self.forward_model(torch.cat([phi_s, action], dim=-1))
 
        # Prediction error = surprise = curiosity reward
        reward = 0.5 * (pred_phi_next - phi_s_next.detach()).pow(2).sum(dim=-1)
        return reward
 
    def loss(self, obs, action, next_obs):
        """Train the forward model to reduce prediction error on visited states."""
        phi_s = self.encoder(obs)
        phi_s_next = self.encoder(next_obs)
        pred_phi_next = self.forward_model(torch.cat([phi_s, action], dim=-1))
        return nn.functional.mse_loss(pred_phi_next, phi_s_next.detach())

Random Network Distillation (RND)

Simpler than ICM. Uses a fixed random network as target, and a predictor network that learns to match it. States that have been seen many times have low prediction error (the predictor has learned them). Novel states have high prediction error.

class RND(nn.Module):
    """Random Network Distillation for exploration bonus."""
 
    def __init__(self, obs_dim, feature_dim=64):
        super().__init__()
        # Fixed random target (never trained)
        self.target = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, feature_dim),
        )
        for p in self.target.parameters():
            p.requires_grad = False
 
        # Predictor (trained to match target)
        self.predictor = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, feature_dim),
        )
 
    def exploration_bonus(self, obs):
        """High bonus for novel states (predictor hasn't learned them yet)."""
        with torch.no_grad():
            target_feat = self.target(obs)
        pred_feat = self.predictor(obs)
        return (pred_feat - target_feat).pow(2).sum(dim=-1)

Curriculum learning

Start with easy tasks, gradually increase difficulty. Humans learn this way — you don’t hand a calculus textbook to someone who hasn’t done arithmetic.

Manual curriculum

Design a sequence of progressively harder environments.

def get_curriculum_env(difficulty):
    """Create environments of increasing difficulty."""
    import gymnasium as gym
 
    if difficulty == 0:
        # Easy: wide corridor, no obstacles
        return gym.make("CartPole-v1")
    elif difficulty == 1:
        # Medium: standard environment
        return gym.make("LunarLander-v2")
    elif difficulty == 2:
        # Hard: continuous control
        return gym.make("BipedalWalker-v3")
 
def train_with_curriculum(agent, difficulties, steps_per_level=100_000):
    """Train agent through progressive difficulty levels."""
    for diff in difficulties:
        env = get_curriculum_env(diff)
        print(f"Training on difficulty {diff}...")
        agent.learn(total_timesteps=steps_per_level, env=env)
        print(f"Difficulty {diff} complete. Moving to next level.")

Domain randomization

Instead of fixed curriculum, randomize environment parameters during training. The agent learns to be robust to variation.

def randomize_env_params():
    """Randomize environment parameters for robust training.
    Used in sim-to-real transfer (see Tutorial - Sim-to-Real Transfer).
    """
    params = {
        "gravity": np.random.uniform(8.0, 12.0),    # Earth ~9.8
        "friction": np.random.uniform(0.5, 1.5),
        "wind": np.random.uniform(-2.0, 2.0),
        "obs_noise_std": np.random.uniform(0.0, 0.1),
    }
    return params

Automatic curriculum generation

Let the environment difficulty adapt to the agent’s performance:

If the agent succeeds > 80% of the time → make it harder
If the agent succeeds < 20% of the time → make it easier
Keep the agent in the “zone of proximal development”

RLHF: human feedback as reward

Reinforcement Learning from Human Feedback. Instead of a programmatic reward, use human preferences.

1. Generate two outputs (a, b) from the model
2. Human says: "a is better than b"
3. Train a reward model to predict human preferences
4. Use the reward model as the RL reward signal
5. Optimize the policy with PPO

This is how ChatGPT, Claude, and most modern language models are trained. See RLHF and Alignment for details.

The challenge: the reward model is a learned proxy for human preferences. If it’s wrong (or if users have systematic biases), the agent hacks the reward model instead of actually being helpful. This is reward hacking at the meta-level.

Practical advice

Start with the simplest reward that could work. Dense, hand-crafted, explicit. Only add complexity when the simple version fails.
Watch your agent. Don’t just look at reward curves. Visualize the behavior. Reward hacking is invisible in metrics.
Use potential-based shaping if you need to guide the agent. It’s the only provably safe shaping method.
Multiple reward terms? Normalize them. If term A ranges [-100, 100] and term B ranges [-1, 1], term A dominates and term B is ignored.
Curriculum helps more than reward shaping in many cases. Making the task easier to explore is better than adding artificial rewards.

Self-test questions

What is Goodhart’s Law, and how does it apply to reward design in RL?
Why does potential-based reward shaping preserve the optimal policy, while arbitrary shaping might not?
Give three examples of reward hacking. What do they have in common?
When would you prefer intrinsic motivation (ICM/RND) over reward shaping?
What is the relationship between RLHF and reward hacking?

Exercises

Shaped vs sparse: Train PPO on a navigation task with sparse reward (+1 at goal) and with shaped reward (-distance per step). Compare learning curves and final behavior. Does the shaped reward lead to different behavior than intended?
Curiosity exploration: Implement RND and add it as exploration bonus in a sparse-reward environment (MountainCar-v0 is a classic choice). Compare exploration behavior with and without curiosity.
Design a curriculum: For the BipedalWalker-v3 environment, design a 3-stage curriculum (easy → medium → hard terrain). Train with and without curriculum. Report learning speed difference.

AI/ML Notes

Explorer

Reward Design and Curriculum

Reward Design and Curriculum

What

The reward design problem

Reward shaping

Potential-based reward shaping

Non-potential shaping (dangerous)

Reward hacking

Famous examples

Why it happens

How to detect and fix

Sparse vs dense rewards

Sparse rewards

Hindsight Experience Replay (HER)

Intrinsic motivation

Curiosity-driven exploration (ICM)

Random Network Distillation (RND)

Curriculum learning

Manual curriculum

Domain randomization

Automatic curriculum generation

RLHF: human feedback as reward

Practical advice

Self-test questions

Exercises

Links

Graph View

Table of Contents

Backlinks