Q-Learning and DQN

What

Value-based RL algorithms that learn $Q(s, a)$ — the expected total future reward for taking action $a$ in state $s$ and then acting optimally. Q-Learning (Watkins, 1989) stores Q-values in a table. DQN (Mnih et al., 2015) replaces the table with a neural network, enabling RL on high-dimensional inputs like raw pixels. DQN was the first RL algorithm to achieve human-level performance on Atari games.

Why It Matters

Off-policy learning: Q-learning learns from old data (replay buffer), making it far more sample-efficient than on-policy methods like REINFORCE
Discrete action spaces: the natural choice for problems with a fixed set of actions (games, routing, recommendation)
Foundation for modern algorithms: Double DQN, Dueling DQN, Rainbow, and distributional RL all build on DQN
Historical significance: DeepMind’s DQN paper (Nature 2015) reignited interest in deep RL, showing a single algorithm could learn 49 Atari games from raw pixels

How It Works

Q-Learning (Tabular)

Maintain a table $Q[s][a]$ for every state-action pair. After each transition $(s, a, r, s')$ :

Q(s, a) ← Q(s, a) + α · [r + γ · max_a' Q(s', a') - Q(s, a)]
                            \___________ TD target ___________/
           \________________ TD error (δ) ___________________/

$\alpha$ : learning rate (how fast to update)
$\gamma$ : discount factor (how much to value future rewards)
$r + \gamma \max_{a'} Q(s', a')$ : the TD target — best estimate of the true Q-value
The update moves $Q(s,a)$ toward the TD target by a step of $\alpha$

Exploration: use epsilon-greedy — with probability $\varepsilon$ take a random action, otherwise take $\arg\max_a Q(s,a)$ . Decay $\varepsilon$ over time: explore a lot early, exploit later.

Convergence: tabular Q-learning provably converges to optimal Q-values if every state-action pair is visited infinitely often and $\alpha$ decays appropriately. In practice, it works well for small state spaces (< 10,000 states).

DQN: Scaling with Neural Networks

For large state spaces (images, continuous observations), a table is impossible. DQN approximates $Q(s, a; \theta)$ with a neural network.

Naive approach fails: training a neural net on sequential RL data is unstable because (1) consecutive transitions are correlated and (2) the target changes as the network updates. DQN solves both problems:

Experience replay: store transitions $(s, a, r, s', done)$ in a buffer. Sample random minibatches to train. This breaks temporal correlation and reuses data (sample efficiency).

Target network: maintain a separate copy of the network $\theta^-$ for computing TD targets. Update $\theta^-$ every C steps by copying $\theta$ . This prevents the moving-target problem.

Loss = (r + γ · max_a' Q(s', a'; θ⁻) - Q(s, a; θ))²
        \____ target (frozen) ____/   \_ prediction _/

DQN Improvements

Variant	Key idea	Why it helps
Double DQN	Use online net to select action, target net to evaluate	Fixes overestimation bias
Dueling DQN	Separate V(s) and A(s,a) streams	Better value estimation for states where action choice doesn’t matter
Prioritized replay	Sample transitions with high TD error more often	Focus learning on surprising transitions
Rainbow	Combine all of the above + distributional + noisy nets	State-of-the-art Atari performance

Code Example

DQN on CartPole (PyTorch)

import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
import numpy as np
from collections import deque
import random
 
class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim),
        )
 
    def forward(self, x):
        return self.net(x)
 
class ReplayBuffer:
    def __init__(self, capacity=10_000):
        self.buffer = deque(maxlen=capacity)
 
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
 
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (torch.tensor(np.array(states), dtype=torch.float32),
                torch.tensor(actions, dtype=torch.long),
                torch.tensor(rewards, dtype=torch.float32),
                torch.tensor(np.array(next_states), dtype=torch.float32),
                torch.tensor(dones, dtype=torch.float32))
 
    def __len__(self):
        return len(self.buffer)
 
def train_dqn(episodes=300, gamma=0.99, lr=1e-3, eps_start=1.0,
              eps_end=0.01, eps_decay=0.995, batch_size=64,
              target_update=10):
    env = gym.make("CartPole-v1")
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
 
    q_net = QNetwork(state_dim, action_dim)
    target_net = QNetwork(state_dim, action_dim)
    target_net.load_state_dict(q_net.state_dict())
 
    optimizer = optim.Adam(q_net.parameters(), lr=lr)
    buffer = ReplayBuffer()
    epsilon = eps_start
 
    for ep in range(episodes):
        state, _ = env.reset()
        total_reward = 0
 
        done = False
        while not done:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    q_vals = q_net(torch.tensor(state, dtype=torch.float32))
                    action = q_vals.argmax().item()
 
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            buffer.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
 
            # Train on a batch from replay buffer
            if len(buffer) >= batch_size:
                s, a, r, s_next, d = buffer.sample(batch_size)
 
                # Current Q-values for chosen actions
                q_values = q_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
 
                # Target Q-values (from frozen target network)
                with torch.no_grad():
                    max_next_q = target_net(s_next).max(dim=1).values
                    targets = r + gamma * max_next_q * (1 - d)
 
                loss = nn.functional.mse_loss(q_values, targets)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
 
        # Decay epsilon
        epsilon = max(eps_end, epsilon * eps_decay)
 
        # Update target network
        if (ep + 1) % target_update == 0:
            target_net.load_state_dict(q_net.state_dict())
 
        if (ep + 1) % 50 == 0:
            print(f"Episode {ep+1}: reward={total_reward:.0f}, eps={epsilon:.3f}")
 
    env.close()
    return q_net
 
q_net = train_dqn()

Key Tradeoffs

Aspect	Q-Learning / DQN	Policy Gradient
Sample efficiency	High (off-policy, replay buffer)	Low (on-policy, fresh data)
Action space	Discrete only	Continuous or discrete
Stability	Target network needed	Clipping (PPO) needed
Exploration	Epsilon-greedy (explicit)	Stochastic policy (implicit)
Overestimation	Common (fix with Double DQN)	Not an issue

Common Pitfalls

Replay buffer too small: if the buffer only holds recent data, you lose diversity and overfit to recent experience. Use at least 10K-100K transitions
No target network: without a frozen target, the Q-network chases a moving target and diverges. Update target every 100-1000 steps
Epsilon decays too fast: the agent stops exploring before finding good strategies. Decay over thousands of episodes, not hundreds
Reward clipping without thought: DQN’s original paper clipped all rewards to [-1, 1]. This works for Atari but destroys information in other environments
Applying DQN to continuous actions: DQN needs $\max_a Q(s,a)$ over all actions. With continuous actions, this max is intractable. Use Policy Gradient Methods or DDPG instead

Exercises

Implement tabular Q-learning on FrozenLake-v1 (4x4). Print the learned Q-table and the derived policy. Does it solve the environment?
Add Double DQN to the code above: use q_net to select the best action but target_net to evaluate it. Compare performance against vanilla DQN
Implement a simple prioritized replay buffer: sample transitions proportional to their TD error. Does it speed up learning on CartPole?
Visualize the Q-values during training: for a fixed set of states, plot how $Q(s, a)$ evolves over episodes. When do they stabilize?

Self-Test Questions

Write out the Q-learning update rule and explain each term. What is the TD error?
Why does DQN need experience replay? What happens if you train on consecutive transitions?
What problem does the target network solve? Why not just use the same network?
Why can’t DQN handle continuous action spaces? What algorithms can?
What is overestimation bias in Q-learning and how does Double DQN fix it?

AI/ML Notes

Explorer

Q-Learning and DQN

Q-Learning and DQN

What

Why It Matters

How It Works

Q-Learning (Tabular)

DQN: Scaling with Neural Networks

DQN Improvements

Code Example

DQN on CartPole (PyTorch)

Key Tradeoffs

Common Pitfalls

Exercises

Self-Test Questions

Links

Graph View

Table of Contents

Backlinks