Multi-Armed Bandits

What

The simplest RL problem: you have k slot machines (arms), each with an unknown reward distribution. Pull arms one at a time, maximize total reward. No states, no transitions — just actions and rewards.

The fundamental tension: exploration vs exploitation. Do you pull the arm that’s been best so far (exploit), or try other arms that might be better (explore)?

Strategies

StrategyHow it worksTradeoff
Epsilon-greedyWith probability eps explore randomly, otherwise exploit best armSimple but wastes exploration on bad arms
UCB (Upper Confidence Bound)Pick arm with highest mean + c * sqrt(ln(t) / n_arm)Explores uncertain arms, mathematically principled
Thompson SamplingMaintain Beta distribution per arm, sample from each, pick highestBayesian, often best in practice
GreedyAlways pick current bestFails badly — gets stuck on suboptimal arms early

Regret

Regret = how much reward you lost compared to always pulling the best arm:

Regret(T) = T * mu_best - sum(rewards)

Good algorithms achieve sublinear regret — the per-step regret shrinks over time. UCB and Thompson Sampling achieve O(sqrt(T)) regret, which is optimal.

Python example

import numpy as np
 
def epsilon_greedy(n_arms=5, n_steps=1000, eps=0.1):
    true_means = np.random.randn(n_arms)  # unknown to the agent
    counts = np.zeros(n_arms)
    values = np.zeros(n_arms)
 
    total_reward = 0
    for t in range(n_steps):
        if np.random.random() < eps:
            arm = np.random.randint(n_arms)   # explore
        else:
            arm = np.argmax(values)            # exploit
 
        reward = np.random.randn() + true_means[arm]
        counts[arm] += 1
        values[arm] += (reward - values[arm]) / counts[arm]  # incremental mean
        total_reward += reward
 
    return total_reward, values

Applications

Bandits show up everywhere you need to balance trying new things vs sticking with what works:

  • A/B testing: allocate more traffic to the winning variant as you learn (adaptive experiments)
  • Recommendation systems: explore new content vs showing known-good items
  • Ad placement: which ad to show to maximize click-through
  • Clinical trials: allocate more patients to the treatment that’s working better
  • Hyperparameter tuning: Bayesian optimization is essentially bandits over parameter space

Contextual bandits

When you have features (context) about the situation, you get contextual bandits. The arm’s reward depends on the context — e.g., which ad to show depends on who’s viewing. This sits between plain bandits and full RL: you have context but still no state transitions.