RL Fundamentals

What

Reinforcement Learning is a framework where an agent learns to make decisions by interacting with an environment. Unlike supervised learning (which needs labeled examples), RL learns from reward signals: take an action, observe the outcome, adjust the strategy. The formal foundation is the Markov Decision Process (MDP), and the core challenge is the exploration-exploitation tradeoff.

Why It Matters

  • Games: AlphaGo, OpenAI Five, AlphaStar — superhuman performance through self-play
  • Robotics: learning locomotion, manipulation, and dexterous tasks from simulation
  • LLM alignment: RLHF (Reinforcement Learning from Human Feedback) is how ChatGPT, Claude, and others learn to follow instructions
  • Recommendation & ads: framing user interaction as a sequential decision problem
  • Operations research: inventory management, network routing, scheduling

RL is the only ML paradigm that handles sequential decision-making under uncertainty with delayed rewards.

How It Works

The Agent-Environment Loop

         action a_t
Agent ──────────────────→ Environment
  ↑                           │
  │     state s_{t+1}         │
  │     reward r_{t+1}        │
  └───────────────────────────┘

At each timestep t:
1. Agent observes state s_t
2. Agent chooses action a_t according to its policy
3. Environment transitions to s_{t+1} and emits reward r_{t+1}
4. Agent updates its policy to maximize cumulative reward

Markov Decision Process (MDP)

The formal framework. An MDP is a tuple $(S, A, T, R, \gamma)$:

ComponentSymbolDescription
States$S$All possible situations the agent can be in
Actions$A$All possible moves the agent can make
Transition`$T(s’s,a)$`
Reward$R(s,a)$Immediate reward for taking action $a$ in state $s$
Discount$\gamma \in [0,1]$How much to value future rewards vs immediate

The Markov property: the future depends only on the current state, not on the history. $T(s'|s,a)$ is all you need — no memory of past states required.

Policy

The agent’s strategy. Maps states to actions.

  • Deterministic: $\pi(s) = a$ — one action per state
  • Stochastic: $\pi(a|s)$ — probability distribution over actions

The goal of RL: find the policy $\pi^*$ that maximizes expected cumulative discounted reward:

π* = argmax_π E [Σ_{t=0}^{∞} γ^t · r_t | π]

Value Functions

State-value function $V^\pi(s)$: expected return starting from state $s$ and following policy $\pi$:

V^π(s) = E_π [Σ_{t=0}^{∞} γ^t · r_t | s_0 = s]

Action-value function $Q^\pi(s,a)$: expected return from taking action $a$ in state $s$, then following $\pi$:

Q^π(s,a) = E_π [Σ_{t=0}^{∞} γ^t · r_t | s_0 = s, a_0 = a]

The relationship: $V^\pi(s) = \sum_a \pi(a|s) \cdot Q^\pi(s,a)$

Bellman Equations

The recursive structure that makes RL tractable. Current value = immediate reward + discounted future value:

V^π(s) = Σ_a π(a|s) · [R(s,a) + γ · Σ_{s'} T(s'|s,a) · V^π(s')]

Q^π(s,a) = R(s,a) + γ · Σ_{s'} T(s'|s,a) · Σ_{a'} π(a'|s') · Q^π(s',a')

For the optimal policy:

V*(s) = max_a [R(s,a) + γ · Σ_{s'} T(s'|s,a) · V*(s')]
Q*(s,a) = R(s,a) + γ · Σ_{s'} T(s'|s,a) · max_{a'} Q*(s',a')

These are the equations that Q-learning and policy gradient methods try to solve.

Exploration vs Exploitation

The fundamental tension in RL. The agent must balance:

  • Exploit: use the best known action (maximize short-term reward)
  • Explore: try new actions to discover potentially better strategies (maximize long-term reward)

Common strategies:

StrategyHow it works
Epsilon-greedyRandom action with probability $\varepsilon$, best action otherwise
Decaying epsilonStart with high $\varepsilon$ (explore), decay toward 0 (exploit)
UCBChoose action with highest $Q + c\sqrt{\ln t / N}$ (optimism under uncertainty)
Boltzmann/softmaxSample actions proportional to $e^{Q(s,a)/\tau}$ (temperature-based)
Entropy bonusAdd $\beta H(\pi)$ to reward (PPO, SAC style)

Taxonomy of RL Algorithms

RL Algorithms
├── Value-based: learn Q(s,a), derive policy from it
│   ├── Q-Learning (tabular)
│   └── DQN (neural network)
├── Policy-based: learn π(a|s) directly
│   ├── REINFORCE
│   └── PPO
├── Actor-Critic: learn both π and V
│   ├── A2C/A3C
│   ├── PPO (also here)
│   └── SAC
└── Model-based: learn the transition function T
    ├── Dyna-Q
    ├── MuZero
    └── World models

Code Example

Gymnasium Basics + Value Iteration

import gymnasium as gym
import numpy as np
 
# --- Interact with an environment ---
env = gym.make("FrozenLake-v1", is_slippery=False)
state, _ = env.reset()
 
for step in range(100):
    action = env.action_space.sample()  # random policy
    next_state, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        state, _ = env.reset()
    else:
        state = next_state
 
# --- Value Iteration (solves MDP exactly when T and R are known) ---
def value_iteration(env, gamma=0.99, theta=1e-8):
    """Find optimal value function and policy for a tabular MDP."""
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    V = np.zeros(n_states)
 
    # Access transition dynamics: env.unwrapped.P[s][a]
    # returns list of (prob, next_state, reward, done)
    P = env.unwrapped.P
 
    while True:
        delta = 0
        for s in range(n_states):
            v = V[s]
            # Bellman optimality update
            action_values = []
            for a in range(n_actions):
                q = sum(prob * (reward + gamma * V[s_next] * (1 - done))
                        for prob, s_next, reward, done in P[s][a])
                action_values.append(q)
            V[s] = max(action_values)
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
 
    # Extract policy from value function
    policy = np.zeros(n_states, dtype=int)
    for s in range(n_states):
        action_values = []
        for a in range(n_actions):
            q = sum(prob * (reward + gamma * V[s_next] * (1 - done))
                    for prob, s_next, reward, done in P[s][a])
            action_values.append(q)
        policy[s] = np.argmax(action_values)
 
    return V, policy
 
env = gym.make("FrozenLake-v1", is_slippery=False)
V, policy = value_iteration(env)
actions = ["←", "↓", "→", "↑"]
print("Optimal policy:")
for i in range(4):
    print([actions[policy[i*4 + j]] for j in range(4)])

Key Tradeoffs

ChoiceTradeoff
$\gamma$ high (0.99) vs low (0.9)Far-sighted (long horizon) vs myopic (short horizon)
Model-free vs model-basedNo environment model needed vs more sample-efficient
On-policy vs off-policyStable but wasteful vs efficient but harder to stabilize
Tabular vs function approxExact solution but limited scale vs scales but may diverge

Exercises

  1. Run value iteration on FrozenLake-v1 with is_slippery=True. How does the optimal policy change compared to the deterministic version? Why?
  2. Implement epsilon-greedy action selection. Run it on FrozenLake with a tabular Q-learning agent. Plot cumulative reward vs episode
  3. Experiment with discount factors: $\gamma = 0.5, 0.9, 0.99, 0.999$. How does the learned policy change? At what $\gamma$ does the agent start caring about the far-away goal?
  4. Modify the code to track how many unique state-action pairs the agent visits. How does epsilon affect exploration coverage?

Self-Test Questions

  1. State the Markov property in one sentence. Why is it important for RL?
  2. What is the difference between $V(s)$ and $Q(s,a)$? If you know $Q$, how do you get $V$?
  3. Write the Bellman optimality equation for $Q^*$. Explain each term
  4. Why is exploration necessary? Give an example where pure exploitation fails
  5. What is the discount factor $\gamma$ and what happens at the extremes ($\gamma = 0$ vs $\gamma = 1$)?