Q-Learning and DQN
What
Value-based RL algorithms that learn $Q(s, a)$ — the expected total future reward for taking action $a$ in state $s$ and then acting optimally. Q-Learning (Watkins, 1989) stores Q-values in a table. DQN (Mnih et al., 2015) replaces the table with a neural network, enabling RL on high-dimensional inputs like raw pixels. DQN was the first RL algorithm to achieve human-level performance on Atari games.
Why It Matters
- Off-policy learning: Q-learning learns from old data (replay buffer), making it far more sample-efficient than on-policy methods like REINFORCE
- Discrete action spaces: the natural choice for problems with a fixed set of actions (games, routing, recommendation)
- Foundation for modern algorithms: Double DQN, Dueling DQN, Rainbow, and distributional RL all build on DQN
- Historical significance: DeepMind’s DQN paper (Nature 2015) reignited interest in deep RL, showing a single algorithm could learn 49 Atari games from raw pixels
How It Works
Q-Learning (Tabular)
Maintain a table $Q[s][a]$ for every state-action pair. After each transition $(s, a, r, s')$:
Q(s, a) ← Q(s, a) + α · [r + γ · max_a' Q(s', a') - Q(s, a)]
\___________ TD target ___________/
\________________ TD error (δ) ___________________/
$\alpha$: learning rate (how fast to update)$\gamma$: discount factor (how much to value future rewards)$r + \gamma \max_{a'} Q(s', a')$: the TD target — best estimate of the true Q-value- The update moves
$Q(s,a)$toward the TD target by a step of$\alpha$
Exploration: use epsilon-greedy — with probability $\varepsilon$ take a random action, otherwise take $\arg\max_a Q(s,a)$. Decay $\varepsilon$ over time: explore a lot early, exploit later.
Convergence: tabular Q-learning provably converges to optimal Q-values if every state-action pair is visited infinitely often and $\alpha$ decays appropriately. In practice, it works well for small state spaces (< 10,000 states).
DQN: Scaling with Neural Networks
For large state spaces (images, continuous observations), a table is impossible. DQN approximates $Q(s, a; \theta)$ with a neural network.
Naive approach fails: training a neural net on sequential RL data is unstable because (1) consecutive transitions are correlated and (2) the target changes as the network updates. DQN solves both problems:
Experience replay: store transitions $(s, a, r, s', done)$ in a buffer. Sample random minibatches to train. This breaks temporal correlation and reuses data (sample efficiency).
Target network: maintain a separate copy of the network $\theta^-$ for computing TD targets. Update $\theta^-$ every C steps by copying $\theta$. This prevents the moving-target problem.
Loss = (r + γ · max_a' Q(s', a'; θ⁻) - Q(s, a; θ))²
\____ target (frozen) ____/ \_ prediction _/
DQN Improvements
| Variant | Key idea | Why it helps |
|---|---|---|
| Double DQN | Use online net to select action, target net to evaluate | Fixes overestimation bias |
| Dueling DQN | Separate V(s) and A(s,a) streams | Better value estimation for states where action choice doesn’t matter |
| Prioritized replay | Sample transitions with high TD error more often | Focus learning on surprising transitions |
| Rainbow | Combine all of the above + distributional + noisy nets | State-of-the-art Atari performance |
Code Example
DQN on CartPole (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
import numpy as np
from collections import deque
import random
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, action_dim),
)
def forward(self, x):
return self.net(x)
class ReplayBuffer:
def __init__(self, capacity=10_000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (torch.tensor(np.array(states), dtype=torch.float32),
torch.tensor(actions, dtype=torch.long),
torch.tensor(rewards, dtype=torch.float32),
torch.tensor(np.array(next_states), dtype=torch.float32),
torch.tensor(dones, dtype=torch.float32))
def __len__(self):
return len(self.buffer)
def train_dqn(episodes=300, gamma=0.99, lr=1e-3, eps_start=1.0,
eps_end=0.01, eps_decay=0.995, batch_size=64,
target_update=10):
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
q_net = QNetwork(state_dim, action_dim)
target_net = QNetwork(state_dim, action_dim)
target_net.load_state_dict(q_net.state_dict())
optimizer = optim.Adam(q_net.parameters(), lr=lr)
buffer = ReplayBuffer()
epsilon = eps_start
for ep in range(episodes):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
# Epsilon-greedy action selection
if random.random() < epsilon:
action = env.action_space.sample()
else:
with torch.no_grad():
q_vals = q_net(torch.tensor(state, dtype=torch.float32))
action = q_vals.argmax().item()
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
buffer.push(state, action, reward, next_state, done)
state = next_state
total_reward += reward
# Train on a batch from replay buffer
if len(buffer) >= batch_size:
s, a, r, s_next, d = buffer.sample(batch_size)
# Current Q-values for chosen actions
q_values = q_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
# Target Q-values (from frozen target network)
with torch.no_grad():
max_next_q = target_net(s_next).max(dim=1).values
targets = r + gamma * max_next_q * (1 - d)
loss = nn.functional.mse_loss(q_values, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Decay epsilon
epsilon = max(eps_end, epsilon * eps_decay)
# Update target network
if (ep + 1) % target_update == 0:
target_net.load_state_dict(q_net.state_dict())
if (ep + 1) % 50 == 0:
print(f"Episode {ep+1}: reward={total_reward:.0f}, eps={epsilon:.3f}")
env.close()
return q_net
q_net = train_dqn()Key Tradeoffs
| Aspect | Q-Learning / DQN | Policy Gradient |
|---|---|---|
| Sample efficiency | High (off-policy, replay buffer) | Low (on-policy, fresh data) |
| Action space | Discrete only | Continuous or discrete |
| Stability | Target network needed | Clipping (PPO) needed |
| Exploration | Epsilon-greedy (explicit) | Stochastic policy (implicit) |
| Overestimation | Common (fix with Double DQN) | Not an issue |
Common Pitfalls
- Replay buffer too small: if the buffer only holds recent data, you lose diversity and overfit to recent experience. Use at least 10K-100K transitions
- No target network: without a frozen target, the Q-network chases a moving target and diverges. Update target every 100-1000 steps
- Epsilon decays too fast: the agent stops exploring before finding good strategies. Decay over thousands of episodes, not hundreds
- Reward clipping without thought: DQN’s original paper clipped all rewards to [-1, 1]. This works for Atari but destroys information in other environments
- Applying DQN to continuous actions: DQN needs
$\max_a Q(s,a)$over all actions. With continuous actions, this max is intractable. Use Policy Gradient Methods or DDPG instead
Exercises
- Implement tabular Q-learning on FrozenLake-v1 (4x4). Print the learned Q-table and the derived policy. Does it solve the environment?
- Add Double DQN to the code above: use
q_netto select the best action buttarget_netto evaluate it. Compare performance against vanilla DQN - Implement a simple prioritized replay buffer: sample transitions proportional to their TD error. Does it speed up learning on CartPole?
- Visualize the Q-values during training: for a fixed set of states, plot how
$Q(s, a)$evolves over episodes. When do they stabilize?
Self-Test Questions
- Write out the Q-learning update rule and explain each term. What is the TD error?
- Why does DQN need experience replay? What happens if you train on consecutive transitions?
- What problem does the target network solve? Why not just use the same network?
- Why can’t DQN handle continuous action spaces? What algorithms can?
- What is overestimation bias in Q-learning and how does Double DQN fix it?
Links
- RL Fundamentals — MDP, Bellman equation, value functions
- Policy Gradient Methods — the alternative to value-based methods
- Actor-Critic Methods — combining value and policy learning
- Multi-Armed Bandits — the simplest value-based problem
- Reinforcement Learning Roadmap