Q-Learning and DQN

What

Value-based RL algorithms that learn $Q(s, a)$ — the expected total future reward for taking action $a$ in state $s$ and then acting optimally. Q-Learning (Watkins, 1989) stores Q-values in a table. DQN (Mnih et al., 2015) replaces the table with a neural network, enabling RL on high-dimensional inputs like raw pixels. DQN was the first RL algorithm to achieve human-level performance on Atari games.

Why It Matters

  • Off-policy learning: Q-learning learns from old data (replay buffer), making it far more sample-efficient than on-policy methods like REINFORCE
  • Discrete action spaces: the natural choice for problems with a fixed set of actions (games, routing, recommendation)
  • Foundation for modern algorithms: Double DQN, Dueling DQN, Rainbow, and distributional RL all build on DQN
  • Historical significance: DeepMind’s DQN paper (Nature 2015) reignited interest in deep RL, showing a single algorithm could learn 49 Atari games from raw pixels

How It Works

Q-Learning (Tabular)

Maintain a table $Q[s][a]$ for every state-action pair. After each transition $(s, a, r, s')$:

Q(s, a) ← Q(s, a) + α · [r + γ · max_a' Q(s', a') - Q(s, a)]
                            \___________ TD target ___________/
           \________________ TD error (δ) ___________________/
  • $\alpha$: learning rate (how fast to update)
  • $\gamma$: discount factor (how much to value future rewards)
  • $r + \gamma \max_{a'} Q(s', a')$: the TD target — best estimate of the true Q-value
  • The update moves $Q(s,a)$ toward the TD target by a step of $\alpha$

Exploration: use epsilon-greedy — with probability $\varepsilon$ take a random action, otherwise take $\arg\max_a Q(s,a)$. Decay $\varepsilon$ over time: explore a lot early, exploit later.

Convergence: tabular Q-learning provably converges to optimal Q-values if every state-action pair is visited infinitely often and $\alpha$ decays appropriately. In practice, it works well for small state spaces (< 10,000 states).

DQN: Scaling with Neural Networks

For large state spaces (images, continuous observations), a table is impossible. DQN approximates $Q(s, a; \theta)$ with a neural network.

Naive approach fails: training a neural net on sequential RL data is unstable because (1) consecutive transitions are correlated and (2) the target changes as the network updates. DQN solves both problems:

Experience replay: store transitions $(s, a, r, s', done)$ in a buffer. Sample random minibatches to train. This breaks temporal correlation and reuses data (sample efficiency).

Target network: maintain a separate copy of the network $\theta^-$ for computing TD targets. Update $\theta^-$ every C steps by copying $\theta$. This prevents the moving-target problem.

Loss = (r + γ · max_a' Q(s', a'; θ⁻) - Q(s, a; θ))²
        \____ target (frozen) ____/   \_ prediction _/

DQN Improvements

VariantKey ideaWhy it helps
Double DQNUse online net to select action, target net to evaluateFixes overestimation bias
Dueling DQNSeparate V(s) and A(s,a) streamsBetter value estimation for states where action choice doesn’t matter
Prioritized replaySample transitions with high TD error more oftenFocus learning on surprising transitions
RainbowCombine all of the above + distributional + noisy netsState-of-the-art Atari performance

Code Example

DQN on CartPole (PyTorch)

import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
import numpy as np
from collections import deque
import random
 
class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim),
        )
 
    def forward(self, x):
        return self.net(x)
 
class ReplayBuffer:
    def __init__(self, capacity=10_000):
        self.buffer = deque(maxlen=capacity)
 
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
 
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (torch.tensor(np.array(states), dtype=torch.float32),
                torch.tensor(actions, dtype=torch.long),
                torch.tensor(rewards, dtype=torch.float32),
                torch.tensor(np.array(next_states), dtype=torch.float32),
                torch.tensor(dones, dtype=torch.float32))
 
    def __len__(self):
        return len(self.buffer)
 
def train_dqn(episodes=300, gamma=0.99, lr=1e-3, eps_start=1.0,
              eps_end=0.01, eps_decay=0.995, batch_size=64,
              target_update=10):
    env = gym.make("CartPole-v1")
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
 
    q_net = QNetwork(state_dim, action_dim)
    target_net = QNetwork(state_dim, action_dim)
    target_net.load_state_dict(q_net.state_dict())
 
    optimizer = optim.Adam(q_net.parameters(), lr=lr)
    buffer = ReplayBuffer()
    epsilon = eps_start
 
    for ep in range(episodes):
        state, _ = env.reset()
        total_reward = 0
 
        done = False
        while not done:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    q_vals = q_net(torch.tensor(state, dtype=torch.float32))
                    action = q_vals.argmax().item()
 
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            buffer.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
 
            # Train on a batch from replay buffer
            if len(buffer) >= batch_size:
                s, a, r, s_next, d = buffer.sample(batch_size)
 
                # Current Q-values for chosen actions
                q_values = q_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
 
                # Target Q-values (from frozen target network)
                with torch.no_grad():
                    max_next_q = target_net(s_next).max(dim=1).values
                    targets = r + gamma * max_next_q * (1 - d)
 
                loss = nn.functional.mse_loss(q_values, targets)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
 
        # Decay epsilon
        epsilon = max(eps_end, epsilon * eps_decay)
 
        # Update target network
        if (ep + 1) % target_update == 0:
            target_net.load_state_dict(q_net.state_dict())
 
        if (ep + 1) % 50 == 0:
            print(f"Episode {ep+1}: reward={total_reward:.0f}, eps={epsilon:.3f}")
 
    env.close()
    return q_net
 
q_net = train_dqn()

Key Tradeoffs

AspectQ-Learning / DQNPolicy Gradient
Sample efficiencyHigh (off-policy, replay buffer)Low (on-policy, fresh data)
Action spaceDiscrete onlyContinuous or discrete
StabilityTarget network neededClipping (PPO) needed
ExplorationEpsilon-greedy (explicit)Stochastic policy (implicit)
OverestimationCommon (fix with Double DQN)Not an issue

Common Pitfalls

  • Replay buffer too small: if the buffer only holds recent data, you lose diversity and overfit to recent experience. Use at least 10K-100K transitions
  • No target network: without a frozen target, the Q-network chases a moving target and diverges. Update target every 100-1000 steps
  • Epsilon decays too fast: the agent stops exploring before finding good strategies. Decay over thousands of episodes, not hundreds
  • Reward clipping without thought: DQN’s original paper clipped all rewards to [-1, 1]. This works for Atari but destroys information in other environments
  • Applying DQN to continuous actions: DQN needs $\max_a Q(s,a)$ over all actions. With continuous actions, this max is intractable. Use Policy Gradient Methods or DDPG instead

Exercises

  1. Implement tabular Q-learning on FrozenLake-v1 (4x4). Print the learned Q-table and the derived policy. Does it solve the environment?
  2. Add Double DQN to the code above: use q_net to select the best action but target_net to evaluate it. Compare performance against vanilla DQN
  3. Implement a simple prioritized replay buffer: sample transitions proportional to their TD error. Does it speed up learning on CartPole?
  4. Visualize the Q-values during training: for a fixed set of states, plot how $Q(s, a)$ evolves over episodes. When do they stabilize?

Self-Test Questions

  1. Write out the Q-learning update rule and explain each term. What is the TD error?
  2. Why does DQN need experience replay? What happens if you train on consecutive transitions?
  3. What problem does the target network solve? Why not just use the same network?
  4. Why can’t DQN handle continuous action spaces? What algorithms can?
  5. What is overestimation bias in Q-learning and how does Double DQN fix it?