RL Fundamentals
What
Reinforcement Learning is a framework where an agent learns to make decisions by interacting with an environment. Unlike supervised learning (which needs labeled examples), RL learns from reward signals: take an action, observe the outcome, adjust the strategy. The formal foundation is the Markov Decision Process (MDP), and the core challenge is the exploration-exploitation tradeoff.
Why It Matters
- Games: AlphaGo, OpenAI Five, AlphaStar — superhuman performance through self-play
- Robotics: learning locomotion, manipulation, and dexterous tasks from simulation
- LLM alignment: RLHF (Reinforcement Learning from Human Feedback) is how ChatGPT, Claude, and others learn to follow instructions
- Recommendation & ads: framing user interaction as a sequential decision problem
- Operations research: inventory management, network routing, scheduling
RL is the only ML paradigm that handles sequential decision-making under uncertainty with delayed rewards.
How It Works
The Agent-Environment Loop
action a_t
Agent ──────────────────→ Environment
↑ │
│ state s_{t+1} │
│ reward r_{t+1} │
└───────────────────────────┘
At each timestep t:
1. Agent observes state s_t
2. Agent chooses action a_t according to its policy
3. Environment transitions to s_{t+1} and emits reward r_{t+1}
4. Agent updates its policy to maximize cumulative reward
Markov Decision Process (MDP)
The formal framework. An MDP is a tuple $(S, A, T, R, \gamma)$:
| Component | Symbol | Description |
|---|---|---|
| States | $S$ | All possible situations the agent can be in |
| Actions | $A$ | All possible moves the agent can make |
| Transition | `$T(s’ | s,a)$` |
| Reward | $R(s,a)$ | Immediate reward for taking action $a$ in state $s$ |
| Discount | $\gamma \in [0,1]$ | How much to value future rewards vs immediate |
The Markov property: the future depends only on the current state, not on the history. $T(s'|s,a)$ is all you need — no memory of past states required.
Policy
The agent’s strategy. Maps states to actions.
- Deterministic:
$\pi(s) = a$— one action per state - Stochastic:
$\pi(a|s)$— probability distribution over actions
The goal of RL: find the policy $\pi^*$ that maximizes expected cumulative discounted reward:
π* = argmax_π E [Σ_{t=0}^{∞} γ^t · r_t | π]
Value Functions
State-value function $V^\pi(s)$: expected return starting from state $s$ and following policy $\pi$:
V^π(s) = E_π [Σ_{t=0}^{∞} γ^t · r_t | s_0 = s]
Action-value function $Q^\pi(s,a)$: expected return from taking action $a$ in state $s$, then following $\pi$:
Q^π(s,a) = E_π [Σ_{t=0}^{∞} γ^t · r_t | s_0 = s, a_0 = a]
The relationship: $V^\pi(s) = \sum_a \pi(a|s) \cdot Q^\pi(s,a)$
Bellman Equations
The recursive structure that makes RL tractable. Current value = immediate reward + discounted future value:
V^π(s) = Σ_a π(a|s) · [R(s,a) + γ · Σ_{s'} T(s'|s,a) · V^π(s')]
Q^π(s,a) = R(s,a) + γ · Σ_{s'} T(s'|s,a) · Σ_{a'} π(a'|s') · Q^π(s',a')
For the optimal policy:
V*(s) = max_a [R(s,a) + γ · Σ_{s'} T(s'|s,a) · V*(s')]
Q*(s,a) = R(s,a) + γ · Σ_{s'} T(s'|s,a) · max_{a'} Q*(s',a')
These are the equations that Q-learning and policy gradient methods try to solve.
Exploration vs Exploitation
The fundamental tension in RL. The agent must balance:
- Exploit: use the best known action (maximize short-term reward)
- Explore: try new actions to discover potentially better strategies (maximize long-term reward)
Common strategies:
| Strategy | How it works |
|---|---|
| Epsilon-greedy | Random action with probability $\varepsilon$, best action otherwise |
| Decaying epsilon | Start with high $\varepsilon$ (explore), decay toward 0 (exploit) |
| UCB | Choose action with highest $Q + c\sqrt{\ln t / N}$ (optimism under uncertainty) |
| Boltzmann/softmax | Sample actions proportional to $e^{Q(s,a)/\tau}$ (temperature-based) |
| Entropy bonus | Add $\beta H(\pi)$ to reward (PPO, SAC style) |
Taxonomy of RL Algorithms
RL Algorithms
├── Value-based: learn Q(s,a), derive policy from it
│ ├── Q-Learning (tabular)
│ └── DQN (neural network)
├── Policy-based: learn π(a|s) directly
│ ├── REINFORCE
│ └── PPO
├── Actor-Critic: learn both π and V
│ ├── A2C/A3C
│ ├── PPO (also here)
│ └── SAC
└── Model-based: learn the transition function T
├── Dyna-Q
├── MuZero
└── World models
Code Example
Gymnasium Basics + Value Iteration
import gymnasium as gym
import numpy as np
# --- Interact with an environment ---
env = gym.make("FrozenLake-v1", is_slippery=False)
state, _ = env.reset()
for step in range(100):
action = env.action_space.sample() # random policy
next_state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, _ = env.reset()
else:
state = next_state
# --- Value Iteration (solves MDP exactly when T and R are known) ---
def value_iteration(env, gamma=0.99, theta=1e-8):
"""Find optimal value function and policy for a tabular MDP."""
n_states = env.observation_space.n
n_actions = env.action_space.n
V = np.zeros(n_states)
# Access transition dynamics: env.unwrapped.P[s][a]
# returns list of (prob, next_state, reward, done)
P = env.unwrapped.P
while True:
delta = 0
for s in range(n_states):
v = V[s]
# Bellman optimality update
action_values = []
for a in range(n_actions):
q = sum(prob * (reward + gamma * V[s_next] * (1 - done))
for prob, s_next, reward, done in P[s][a])
action_values.append(q)
V[s] = max(action_values)
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
# Extract policy from value function
policy = np.zeros(n_states, dtype=int)
for s in range(n_states):
action_values = []
for a in range(n_actions):
q = sum(prob * (reward + gamma * V[s_next] * (1 - done))
for prob, s_next, reward, done in P[s][a])
action_values.append(q)
policy[s] = np.argmax(action_values)
return V, policy
env = gym.make("FrozenLake-v1", is_slippery=False)
V, policy = value_iteration(env)
actions = ["←", "↓", "→", "↑"]
print("Optimal policy:")
for i in range(4):
print([actions[policy[i*4 + j]] for j in range(4)])Key Tradeoffs
| Choice | Tradeoff |
|---|---|
$\gamma$ high (0.99) vs low (0.9) | Far-sighted (long horizon) vs myopic (short horizon) |
| Model-free vs model-based | No environment model needed vs more sample-efficient |
| On-policy vs off-policy | Stable but wasteful vs efficient but harder to stabilize |
| Tabular vs function approx | Exact solution but limited scale vs scales but may diverge |
Exercises
- Run value iteration on FrozenLake-v1 with
is_slippery=True. How does the optimal policy change compared to the deterministic version? Why? - Implement epsilon-greedy action selection. Run it on FrozenLake with a tabular Q-learning agent. Plot cumulative reward vs episode
- Experiment with discount factors:
$\gamma = 0.5, 0.9, 0.99, 0.999$. How does the learned policy change? At what$\gamma$does the agent start caring about the far-away goal? - Modify the code to track how many unique state-action pairs the agent visits. How does epsilon affect exploration coverage?
Self-Test Questions
- State the Markov property in one sentence. Why is it important for RL?
- What is the difference between
$V(s)$and$Q(s,a)$? If you know$Q$, how do you get$V$? - Write the Bellman optimality equation for
$Q^*$. Explain each term - Why is exploration necessary? Give an example where pure exploitation fails
- What is the discount factor
$\gamma$and what happens at the extremes ($\gamma = 0$vs$\gamma = 1$)?
Links
- Q-Learning and DQN — value-based methods
- Policy Gradient Methods — policy-based methods
- Actor-Critic Methods — combining both approaches
- Multi-Armed Bandits — simplest RL setting (no states)
- Reinforcement Learning Roadmap — full study path