Bandits
The Bandit Problem
All visualizations are motivated by the Sutton + Barto book. I found it a good exercise to reproduce them.
These are the simplest types of problems in RL, where there is no state involved. Instead, there are actions available with different reward distributions, yet there is no consequence on what state is sampled next.
Fun fact: the name "-armed bandit" is an allusion to the slot machine, AKA the one-armed bandit that steals your money.
Formulation
The agent is provided with a set of actions. Each round consists of selecting any action to get reward The goal is to maximize the reward over multiple rounds of the game. In the simplest form of the game, the reward distribution is stationary. Otherwise, we call it unstationary. We'll mostly concern ourselves with the stationary case.
Let's assume where We call the true action-value function, as it describes the true "value" of selecting each action. Clearly, one way to do well at this game is to try and estimate through interactions with the bandits. Then we can exploit what we know by consistently selecting the one with the highest value. So there are two challenges:
- How do we estimate ?
- What strategy should we follow to explore different ?
This foreshadows a key trade-off in RL: how to balance explotation vs. exploration.
Estimating
Say there is only one bandit, and we select it a bunch of times to get the sequence The natural way to estimate is just with the empirical average, The subscript denotes the estimate of at iteration This clearly has good convergence properties. But, people in RL actually prefer to write this in a different way, namely:
This is pretty clear from induction, e.g.:
More generally, we can refer to the as the stepsize , where There are some convergence properties on for stationary rewards. Namely, we need and but this does not tell us to what it converges in the general case.
Simulating the Problem
import numpy as np
class Bandit:
def __init__(self, k, shift_qstar=0):
self.q_star = np.random.randn(k) + shift_qstar
self.k = k
def step(self, action):
assert action < self.k, "action not in valid range"
reward = self.q_star[action] + np.random.randn()
return reward
prob = Bandit(10)
prob.q_star.shape(10,)
Exploration Methods
We can start to talk about the most basic types of policies now.
-greedy
-greedy means that with probability , we go with action Otherwise, with probability we select a random one. Let's try simulating different values of and
import numpy as np
class EpsilonGreedy:
def __init__(self, initial_q, epsilon, k):
self.initial_q = initial_q
self.epsilon = epsilon
self.k = k
self.Q = np.ones(k) * initial_q
self.N = np.ones(k)
def sample_act(self, t=None):
if np.random.random() < self.epsilon:
return np.random.choice(range(self.k))
return np.argmax(self.Q)
def update_q(self, r_t, a_t):
self.Q[a_t] += 1/self.N[a_t] * (r_t - self.Q[a_t])
self.N[a_t] += 1class Runner:
def __init__(self, problem, algo):
self.problem = problem
self.algo = algo
def run(self, n_steps):
optimal_action = np.argmax(self.problem.q_star)
optimal = []
prev = 0
for i in range(n_steps):
action = self.algo.sample_act(i+1)
r = self.problem.step(action)
self.algo.update_q(r, action)
optimal.append(prev + (action == optimal_action))
prev = optimal[-1]
pct_optimal = np.array(optimal) / (np.arange(n_steps) + 1)
return pct_optimalPlotting Utils
import matplotlib
matplotlib.rc_file("./matplotlibrc")
from matplotlib import pyplot as plt
def plot_multiple(arrays, labels):
colors = plt.get_cmap("Set1").colors
for i, (arr, label) in enumerate(zip(arrays, labels)):
arr = np.array(arr) * 100
steps = np.arange(len(arr))
plt.plot(steps, arr, label=label, color=colors[i])
plt.xlabel("Step Count")
plt.ylabel("Optimal Action (%)")
plt.yticks(np.arange(0, 101, 20))
max_len = max(len(a) for a in arrays)
xticks = np.linspace(0, max_len, 6, dtype=int)
plt.xticks(xticks)
plt.legend(loc="upper center")
plt.tight_layout()
plt.show()Experiment Setup
def run_experiment(k=10, n_steps=1000, n_problems=2000):
avg_res_1 = np.zeros(n_steps)
avg_res_2 = np.zeros(n_steps)
avg_res_3 = np.zeros(n_steps)
for _ in range(n_problems):
problem = Bandit(k)
config_1 = EpsilonGreedy(0, 0.01, k) # optimistic, greedy
config_2 = EpsilonGreedy(0, 0.1, k) # eps-greedy
config_3 = EpsilonGreedy(0, 0, k) # eps-greedy
res_1 = Runner(problem, config_1).run(n_steps)
res_2 = Runner(problem, config_2).run(n_steps)
res_3 = Runner(problem, config_3).run(n_steps)
avg_res_1 += res_1
avg_res_2 += res_2
avg_res_3 += res_3
avg_res_1 /= n_problems
avg_res_2 /= n_problems
avg_res_3 /= n_problems
return avg_res_1, avg_res_2, avg_res_3res_1, res_2, res_3 = run_experiment()
plot_multiple([res_1, res_2, res_3], ["ε=0.01", "ε=0.1", "ε=0"])Upper Confidence Bound
In -greedy, during exploration we select the action completely by random. Upper Confidence Bound (UCB) proposes a slight compromise, where you factor in the number of times has been selected before. Intuitively this number should be inversely related to how much you want to explore it. The policy is given by
where is the count for how many times has been selected by time Let's compare it to -greedy.
class UCB(EpsilonGreedy):
def __init__(self, c, *args, **kwargs):
super().__init__(*args, **kwargs)
self.c = c
self.counts = np.zeros(self.k)
def sample_act(self, t):
unexplored = np.where(self.counts == 0)
if unexplored[0].shape[0] > 0:
return unexplored[0][0]
act = np.argmax(self.Q + self.c * np.sqrt(np.log(t) / self.counts))
return act
def update_q(self, r_t, a_t):
super().update_q(r_t, a_t)
self.counts[a_t] += 1
Experiment Setup
def run_experiment(k=10, n_steps=1000, n_problems=2000):
avg_res_1 = np.zeros(n_steps)
avg_res_2 = np.zeros(n_steps)
for _ in range(n_problems):
problem = Bandit(k)
config_1 = UCB(2, 0, None, k)
config_2 = EpsilonGreedy(0, 0.1, k)
res_1 = Runner(problem, config_1).run(n_steps)
res_2 = Runner(problem, config_2).run(n_steps)
avg_res_1 += res_1
avg_res_2 += res_2
avg_res_1 /= n_problems
avg_res_2 /= n_problems
return avg_res_1, avg_res_2res_1, res_2 = run_experiment()
plot_multiple([res_1, res_2], ["UCB w/ c=2", "ε=0.1"])Gradient Bandits
Observe that, if we start with some ground truth , and we form the behavior of -greedy and UCB will change, since we must make some choice for and But, all that really matters when selecting an optimal action is the relative order between two actions. So, we can instead enode some notion of "energy" for each action that represents relative preference. More concretely, let denote the energy of action . We can initialize and follow the model
Given (action, reward) we'd update with
The intuition is that this is the form of a single-sample approximation to the update The choice of is arbitrary.
import torch.nn.functional as F
import torch
class GradientBandit:
def __init__(self, alpha, k, use_baseline):
self.alpha = alpha
self.k = k
self.H = torch.zeros(k)
self.r_bar = np.zeros(k)
self.counts = np.ones(k)
self.use_baseline = use_baseline
def sample_act(self, t):
dist = F.softmax(self.H, dim=0).numpy()
return np.random.choice(self.k, p=dist)
def update_q(self, r_t, a_t):
dist = F.softmax(self.H, dim=0).numpy()
self.r_bar[a_t] += 1/self.counts[a_t] * (r_t - self.r_bar[a_t])
self.counts[a_t] += 1
if self.use_baseline:
baseline = self.r_bar
else:
baseline = 0
self.H += self.alpha * (r_t - baseline) * ((np.arange(self.k) == a_t).astype(float) - dist) Experiment Setup
from joblib import Parallel, delayed
from tqdm import tqdm
import numpy as np
def one_experiment(k, n_steps):
problem = Bandit(k, shift_qstar=4)
config_1 = GradientBandit(0.1, k, True)
config_2 = GradientBandit(0.1, k, False)
res_1 = Runner(problem, config_1).run(n_steps)
res_2 = Runner(problem, config_2).run(n_steps)
return res_1, res_2
def run_experiment(k=10, n_steps=1000, n_problems=2000, n_jobs=-1):
results = Parallel(n_jobs=n_jobs)(
delayed(one_experiment)(k, n_steps)
for _ in tqdm(range(n_problems))
)
res_1 = np.mean([r[0] for r in results], axis=0)
res_2 = np.mean([r[1] for r in results], axis=0)
return res_1, res_2
res_1, res_2 = run_experiment()
plot_multiple([res_1, res_2], ["ε=0.1 + baseline", "ε=0.1 + no baseline"])100%|███████████████| 2000/2000 [00:12<00:00, 162.07it/s]
Things will become more interesting next when we start considering states as well.