Multi-Armed Bandits
What
The simplest RL problem: you have k slot machines (arms), each with an unknown reward distribution. Pull arms one at a time, maximize total reward. No states, no transitions — just actions and rewards.
The fundamental tension: exploration vs exploitation. Do you pull the arm that’s been best so far (exploit), or try other arms that might be better (explore)?
Strategies
| Strategy | How it works | Tradeoff |
|---|---|---|
| Epsilon-greedy | With probability eps explore randomly, otherwise exploit best arm | Simple but wastes exploration on bad arms |
| UCB (Upper Confidence Bound) | Pick arm with highest mean + c * sqrt(ln(t) / n_arm) | Explores uncertain arms, mathematically principled |
| Thompson Sampling | Maintain Beta distribution per arm, sample from each, pick highest | Bayesian, often best in practice |
| Greedy | Always pick current best | Fails badly — gets stuck on suboptimal arms early |
Regret
Regret = how much reward you lost compared to always pulling the best arm:
Regret(T) = T * mu_best - sum(rewards)
Good algorithms achieve sublinear regret — the per-step regret shrinks over time. UCB and Thompson Sampling achieve O(sqrt(T)) regret, which is optimal.
Python example
import numpy as np
def epsilon_greedy(n_arms=5, n_steps=1000, eps=0.1):
true_means = np.random.randn(n_arms) # unknown to the agent
counts = np.zeros(n_arms)
values = np.zeros(n_arms)
total_reward = 0
for t in range(n_steps):
if np.random.random() < eps:
arm = np.random.randint(n_arms) # explore
else:
arm = np.argmax(values) # exploit
reward = np.random.randn() + true_means[arm]
counts[arm] += 1
values[arm] += (reward - values[arm]) / counts[arm] # incremental mean
total_reward += reward
return total_reward, valuesApplications
Bandits show up everywhere you need to balance trying new things vs sticking with what works:
- A/B testing: allocate more traffic to the winning variant as you learn (adaptive experiments)
- Recommendation systems: explore new content vs showing known-good items
- Ad placement: which ad to show to maximize click-through
- Clinical trials: allocate more patients to the treatment that’s working better
- Hyperparameter tuning: Bayesian optimization is essentially bandits over parameter space
Contextual bandits
When you have features (context) about the situation, you get contextual bandits. The arm’s reward depends on the context — e.g., which ad to show depends on who’s viewing. This sits between plain bandits and full RL: you have context but still no state transitions.
Links
- RL Fundamentals — bandits are the simplest RL setting
- Reinforcement Learning Roadmap — where bandits fit
- Q-Learning and DQN — adding states and transitions to the picture