Learn AI Series):Before we get into today's material, let's clear the desk and work through last episode's three exercises. All of these build on the Blackjack machinery from episode #105 -- BlackjackEnv, simple_policy, and first_visit_mc_evaluation -- so I'm assuming those are imported and sitting in scope.
Exercise 1: Add a usable-ace policy comparison to the Blackjack evaluator. Run first_visit_mc_evaluation on the simple_policy, then print the value table separately for usable_ace = True and usable_ace = False, and compare where the soft ace actually helps.
import numpy as np
from collections import defaultdict
# Assumes BlackjackEnv, simple_policy, first_visit_mc_evaluation from episode #105.
def print_value_table(V, usable_ace):
label = "usable ace" if usable_ace else "no usable ace"
print(f"\nState values ({label}):")
print(f"{'sum':>5}", end="")
for dealer in range(2, 12):
print(f"{dealer:>7}", end="")
print()
for player_sum in range(21, 11, -1):
print(f"{player_sum:>5}", end="")
for dealer in range(2, 12):
v = V.get((player_sum, dealer, usable_ace), 0.0)
print(f"{v:>7.2f}", end="")
print()
env = BlackjackEnv()
V = first_visit_mc_evaluation(env, simple_policy, n_episodes=500000)
print_value_table(V, usable_ace=False)
print_value_table(V, usable_ace=True)
The two tables tell a clean story: a usable ace is never a liability, it is a free option. A soft hand cannot bust on the next hit (the ace just quietly counts as 1 instead of 11), so the downside is capped. The values come out consistently higher in the usable-ace table, and the gap is widest right in the 12-16 swamp -- exactly where a hard hand is one unlucky card away from death. With the ace in hand, that same range stops being terrifying.
Exercise 2: Implement and compare ordinary vs weighted importance sampling on an off-policy Blackjack evaluation. Use a uniform-random behavior policy and a deterministic target (stick on 18+), and track both estimators for a single fixed state across increasing episode counts.
import numpy as np
import random
# Assumes BlackjackEnv from episode #105.
def behavior_policy(state):
return random.randint(0, 1) # uniform over {stick, hit}
def behavior_prob(state, action):
return 0.5 # uniform => 0.5 each
def target_policy(state):
player_sum, _, _ = state
return 0 if player_sum >= 18 else 1 # stick on 18+, else hit
def target_prob(state, action):
return 1.0 if action == target_policy(state) else 0.0
def off_policy_estimates(env, fixed_state, episode_counts, gamma=1.0):
max_ep = max(episode_counts)
ord_num, ord_den = 0.0, 0 # ordinary IS: divide by episode count
wis_num, wis_den = 0.0, 0.0 # weighted IS: divide by summed weights
snapshots = {}
for ep in range(1, max_ep + 1):
episode = []
state = env.reset()
done = False
while not done:
action = behavior_policy(state)
ns, reward, done = env.step(action)
episode.append((state, action, reward))
state = ns
# walk backwards, accumulating return G and importance ratio W
G, W, record = 0.0, 1.0, None
for state, action, reward in reversed(episode):
G = reward + gamma * G
if target_prob(state, action) == 0.0:
W = 0.0
else:
W *= target_prob(state, action) / behavior_prob(state, action)
if state == fixed_state:
record = (W, G) # keep overwriting => ends on first visit
if record is not None:
W_fs, G_fs = record
ord_num += W_fs * G_fs
ord_den += 1
wis_num += W_fs * G_fs
wis_den += W_fs
if ep in episode_counts:
ois = ord_num / ord_den if ord_den else 0.0
wis = wis_num / wis_den if wis_den else 0.0
snapshots[ep] = (ois, wis)
return snapshots
env = BlackjackEnv()
fixed = (13, 2, False)
snaps = off_policy_estimates(env, fixed, [1000, 10000, 100000, 1000000])
print(f"{'episodes':>10}{'ordinary IS':>14}{'weighted IS':>14}")
for ep in sorted(snaps):
ois, wis = snaps[ep]
print(f"{ep:>10}{ois:>14.3f}{wis:>14.3f}")
Run it and the bias/variance trade-off from episode #105 stops being a slogan and becomes a column of numbers. Ordinary IS lurches -- every so often a rare trajectory with a big importance ratio lands and yanks the estimate sideways. Weighted IS divides by the summed weights instead of the raw count, and it settles down smoothly toward the answer. It carries a whiff of bias early on, but on a noisy problem like this it wins comfortably. When in doubt, weighted.
Exercise 3: Build a first-visit vs every-visit race in a small environment where states genuinely repeat within an episode, and measure how fast each estimator approaches a long reference run.
import numpy as np
from collections import defaultdict
import random
class TinyGrid:
"""3x3 grid. Start top-left, terminal bottom-right, -1 per step.
A random policy loops over the same cells constantly."""
def __init__(self, size=3):
self.size = size
self.terminal = (size - 1, size - 1)
def reset(self):
self.pos = (0, 0)
return self.pos
def step(self, action):
r, c = self.pos
if action == 0: r = max(0, r - 1)
elif action == 1: c = min(self.size - 1, c + 1)
elif action == 2: r = min(self.size - 1, r + 1)
elif action == 3: c = max(0, c - 1)
self.pos = (r, c)
done = self.pos == self.terminal
return self.pos, -1.0, done
def random_policy(state):
return random.randint(0, 3)
def mc_eval(env, policy, n_episodes, first_visit, gamma=1.0):
V = defaultdict(float)
count = defaultdict(int)
for _ in range(n_episodes):
episode = []
state = env.reset()
done = False
while not done:
a = policy(state)
ns, reward, done = env.step(a)
episode.append((state, reward))
state = ns
G = 0.0
seen = set()
for t in range(len(episode) - 1, -1, -1):
s, reward = episode[t]
G = reward + gamma * G
if first_visit and s in seen:
continue
seen.add(s)
count[s] += 1
V[s] += (G - V[s]) / count[s]
return dict(V)
env = TinyGrid()
ref = mc_eval(env, random_policy, 200000, first_visit=True) # reference
target_state = (0, 0)
print(f"reference V{target_state} = {ref[target_state]:.2f}")
for n in [200, 1000, 5000, 20000]:
fv = mc_eval(env, random_policy, n, first_visit=True)
ev = mc_eval(env, random_policy, n, first_visit=False)
print(f"n={n:>6} first={fv[target_state]:8.2f} every={ev[target_state]:8.2f}")
On this tiny grid the random walk wanders into the same cells over and over before it stumbles onto the exit, so every-visit MC squeezes many updates out of each episode and tends to close on the reference faster at low episode counts. But -- and this is the catch the exercise was fishing for -- those repeated visits inside one episode are heavily correlated (they share the same tail of randomness), so every-visit's extra data is not as informative as it looks. As the episode count climbs, first-visit's cleaner, independent samples catch right up. More data, yes; more independent data, not quite.
Right, episode 106. Let me set the scene with where the last two episodes left us, because today is genuinely the payoff.
Episode #104 gave us dynamic programming: powerful, exact, and utterly dependent on having the full model of the environment handed to us on a plate. Episode #105 gave us Monte Carlo: model-free at last, learning purely from experience -- but with one stubborn limitation. MC has to wait for the episode to end before it can learn anything, because it needs the actual return. No ending, no return, no update. An agent that runs forever just sits there twiddling its thumbs.
So we've got two tools, each strong exactly where the other is weak. DP updates after every step but needs a model. MC needs no model but only learns at the finish line. The obvious question -- the one I deliberately dangled at the end of last episode -- is whether we can have both. Learn from raw experience like MC, and update incrementally like DP.
We can. The answer is temporal difference (TD) learning, and I do not think it is an exaggeration to call it the single most important idea in reinforcement learning. Richard Sutton, who formalised it in the 1980s, calls TD "one of the most fundamental and novel ideas in the field." Having said that, let's see what makes it tick.
In Monte Carlo, you update a state's value using the actual return G_t -- the total discounted reward from that state until the episode ends. You need the whole episode in hand first.
TD throws that requirement out. In TD(0), you update using the TD target: the immediate reward plus the estimated value of the next state. And crucially, you update after every single step:
MC update: V(s) <- V(s) + alpha * [ G_t - V(s) ]
TD update: V(s) <- V(s) + alpha * [ r + gamma*V(s') - V(s) ]
That term r + gamma*V(s') is the TD target. The difference r + gamma*V(s') - V(s) is the TD error (commonly written as delta) -- it measures how much the transition we just saw surprised us, relative to what we already believed. Positive delta: the world was better than expected, nudge the value up. Negative delta: worse than expected, nudge it down.
import numpy as np
from collections import defaultdict
def td0_evaluation(env, policy, n_episodes=5000, alpha=0.1, gamma=1.0):
"""TD(0): evaluate a policy by learning from single transitions."""
V = defaultdict(float)
for _ in range(n_episodes):
state = env.reset()
done = False
while not done:
action = policy(state)
next_state, reward, done = env.step(action)
# TD update: V(s) += alpha * [r + gamma*V(s') - V(s)]
td_target = reward + gamma * V[next_state] * (1 - done)
td_error = td_target - V[state]
V[state] += alpha * td_error
state = next_state
return dict(V)
Look at where the update lives: inside the while not done loop. We improve V(state) immediately after taking one step, long before the episode finishes. That is the whole difference from Monte Carlo, and it is bigger than it looks.
The slightly mind-bending part is the V[next_state] sitting inside the target. We are using our current estimate of the next state's value to improve our estimate of the current state's value. Estimates improving estimates improving estimates. This is called bootstrapping (as in "pulling yourself up by your own bootstraps"), and the first time you meet it your instinct screams that it must be circular nonsense. It isn't. It converges. Let me explain why.
Cast your mind back to the Bellman equation from episode #104:
V(s) = E[ r + gamma*V(s') ]
The true value of a state equals the expected immediate reward plus the discounted value of wherever you land next. The TD update is nothing more than a stochastic approximation of this equation. Each transition (s, a, r, s') hands us one noisy sample of r + gamma*V(s'). By nudging V(s) toward many such samples with a small step size alpha, the estimate converges to the expectation. If you squint, it is just stochastic gradient descent on the Bellman equation -- and we have been doing SGD since the early episodes of this series.
The deep way to understand the difference between MC and TD is through bias and variance:
r + gamma*V(s') is biased (because V(s') is an estimate, not ground truth) but low variance (only a single random step of noise gets baked in).Lower variance is why TD usually learns faster in practice, and the bias quietly shrinks as V(s') gets better. You trade a little correctness-on-average for a lot of stability, and in stead of waiting for noisy full returns you learn from every step -- on most problems that is a trade you happily take.
Evaluation is nice, but we want control -- finding the best policy, not just scoring a fixed one. As in episode #105, the model-free setting forces us to learn action values Q(s,a) instead of state values V(s), because going greedy on V needs a model look-ahead we do not have, while greedy on Q is just argmax_a Q(s,a).
The first TD control algorithm is SARSA, and the name is the algorithm: it updates from the tuple (S, A, R, S', A') -- state, action, reward, next state, next action:
def sarsa(env, n_episodes=5000, alpha=0.1, gamma=1.0, epsilon=0.1):
"""SARSA: on-policy TD control."""
Q = defaultdict(lambda: np.zeros(env.n_actions))
def epsilon_greedy(state):
if np.random.random() < epsilon:
return np.random.randint(env.n_actions)
return int(np.argmax(Q[state]))
for _ in range(n_episodes):
state = env.reset()
action = epsilon_greedy(state)
done = False
while not done:
next_state, reward, done = env.step(action)
next_action = epsilon_greedy(next_state)
# SARSA update: Q(s,a) += alpha * [r + gamma*Q(s',a') - Q(s,a)]
td_target = reward + gamma * Q[next_state][next_action] * (1 - done)
Q[state][action] += alpha * (td_target - Q[state][action])
state = next_state
action = next_action
return dict(Q)
The detail that defines SARSA is Q[next_state][next_action], where next_action is the action the agent actually takes next -- and that action came out of epsilon_greedy, so it includes the random exploration steps. SARSA therefore learns the value of the policy as it is genuinely being followed, warts and all. It is on-policy: the policy generating the data and the policy being evaluated are one and the same. (We are reusing epsilon-greedy straight from the bandit toolkit in episode #103 -- it keeps turning up because exploration never stops mattering.)
Now for the famous one. Q-Learning (Watkins, 1989) keeps SARSA's skeleton but changes a single word. Instead of the action the agent actually takes, it bootstraps from the best possible action in the next state:
def q_learning(env, n_episodes=5000, alpha=0.1, gamma=1.0, epsilon=0.1):
"""Q-Learning: off-policy TD control."""
Q = defaultdict(lambda: np.zeros(env.n_actions))
def epsilon_greedy(state):
if np.random.random() < epsilon:
return np.random.randint(env.n_actions)
return int(np.argmax(Q[state]))
for _ in range(n_episodes):
state = env.reset()
done = False
while not done:
action = epsilon_greedy(state)
next_state, reward, done = env.step(action)
# Q-Learning update: Q(s,a) += alpha * [r + gamma*max_a' Q(s',a') - Q(s,a)]
td_target = reward + gamma * np.max(Q[next_state]) * (1 - done)
Q[state][action] += alpha * (td_target - Q[state][action])
state = next_state
return dict(Q)
The entire difference between the two algorithms is Q[next_state][next_action] (SARSA) versus np.max(Q[next_state]) (Q-Learning). One word -- but it changes what the agent is learning about.
Q-Learning is off-policy: it learns about the optimal policy (always take the best action) while following a different, exploratory policy (epsilon-greedy). It does not care that the behaviour policy occasionally does something daft for the sake of exploration -- the max in the target always assumes optimal play from the next state onward. SARSA, by contrast, is on-policy: the random exploration steps are baked right into what it learns. That sounds like a pedantic distinction. The cliff walking example shows it is anything but.
This is the classic environment for making SARSA and Q-Learning disagree out loud. Picture a 4x12 grid. The whole bottom row except the two ends is a cliff. Start in the bottom-left, goal in the bottom-right. Every step costs -1. Step into the cliff and you take a brutal -100 and get teleported back to the start.
class CliffWalking:
"""4x12 grid with a cliff along the bottom row.
Start: bottom-left. Goal: bottom-right. Cliff: bottom row (minus the ends).
Falling off: -100 and back to start. Normal step: -1."""
def __init__(self):
self.rows, self.cols = 4, 12
self.start = (3, 0)
self.goal = (3, 11)
self.n_actions = 4
self.reset()
def reset(self):
self.pos = self.start
return self.pos
def step(self, action):
row, col = self.pos
if action == 0: row = max(0, row - 1) # up
elif action == 1: col = min(11, col + 1) # right
elif action == 2: row = min(3, row + 1) # down
elif action == 3: col = max(0, col - 1) # left
self.pos = (row, col)
# cliff check: bottom row, columns 1..10
if row == 3 and 1 <= col <= 10:
self.pos = self.start
return self.pos, -100.0, False # fell, back to start
done = self.pos == self.goal
return self.pos, -1.0, done
env = CliffWalking()
Q_sarsa = sarsa(env, n_episodes=500, alpha=0.5, gamma=1.0, epsilon=0.1)
Q_qlearn = q_learning(env, n_episodes=500, alpha=0.5, gamma=1.0, epsilon=0.1)
Train both and read off the greedy paths they prefer, and you get two genuinely different personalities:
Q-Learning learns the optimal path -- hug the cliff edge, march straight along the bottom row, shortest possible route. Theoretically perfect. But during training epsilon-greedy keeps occasionally shoving the agent sideways into the cliff, racking up -100 penalties. Q-Learning shrugs: its target always assumes optimal continuation, so it learns the value of the greedy path and treats the exploration disasters as somebody else's problem.
SARSA learns the safe path -- climb a row up away from the cliff, walk along the safe ledge, and only drop down to the goal at the very end. Longer (more -1 steps) but it does not flirt with catastrophe. Because SARSA evaluates the policy it actually follows (random exploration included), it has learned a hard truth: being next to the cliff is dangerous for an agent that sometimes acts randomly. So it stays away.
Which one is "right"? Here's the thing -- it depends entirely on what you plan to do with the policy. If you will eventually deploy the pure greedy policy with no exploration, Q-Learning hands you the genuinely optimal behaviour. But if the agent's real-world behaviour will keep some randomness in it (and during training, or in any noisy deployment, it does), SARSA's cautious path earns more actual reward. Q-Learning optimises the policy you wish you were running; SARSA optimises the one you are actually running. Both answers are correct -- to different questions ;-)
So far it looks like TD and MC are rival camps. They are really two ends of one continuous spectrum. TD(0) takes one real reward and bootstraps the rest. Monte Carlo takes all the real rewards and bootstraps nothing. N-step TD lets you sit anywhere in between -- take n real rewards, then bootstrap from the value estimate n steps out:
def n_step_td(env, policy, n_steps=3, n_episodes=5000, alpha=0.1, gamma=1.0):
"""N-step TD evaluation: use n real rewards, then bootstrap."""
V = defaultdict(float)
for _ in range(n_episodes):
state = env.reset()
states = [state]
rewards = [0.0] # placeholder so indices line up with time
T = float('inf') # episode length, unknown until we hit the end
t = 0
while True:
if t < T:
action = policy(state)
next_state, reward, done = env.step(action)
states.append(next_state)
rewards.append(reward)
if done:
T = t + 1
state = next_state
# the state whose value we can finally update is at time tau
tau = t - n_steps + 1
if tau >= 0:
# n-step return: sum of the real rewards in the window
G = sum(gamma ** (i - tau - 1) * rewards[i]
for i in range(tau + 1, min(tau + n_steps, T) + 1))
# bootstrap if the window did not already reach the terminal
if tau + n_steps < T:
G += gamma ** n_steps * V[states[tau + n_steps]]
V[states[tau]] += alpha * (G - V[states[tau]])
if tau == T - 1:
break
t += 1
return dict(V)
The bookkeeping looks fiddly, but the idea is simple. We delay each update until we have collected n real rewards beyond the state in question, then we bootstrap from there:
In practice the sweet spot is almost never at either extreme. Intermediate values (roughly n = 3 to 10) usually learn fastest: you keep some of Monte Carlo's unbiasedness while holding on to most of TD's low variance. The best n is problem-dependent, so it becomes yet another hyperparameter to tune -- and you will be tuning quite some of those by the time this series is done. But knowing the dial exists is half the battle.
Step back and the three families we have covered -- DP, MC, TD -- are not separate religions. They are three points on a map defined by two simple yes/no questions: do you need a model, and do you bootstrap?
| Method | Model required? | Bootstraps? | Needs complete episodes? |
|---|---|---|---|
| Dynamic Programming | Yes | Yes | No |
| Monte Carlo | No | No | Yes |
| Temporal Difference | No | Yes | No |
Read that bottom row and TD's appeal jumps out: model-free like Monte Carlo, and incremental like dynamic programming. It quietly grabbed the best column from each of its predecessors. That combination is exactly why TD methods dominate modern reinforcement learning -- and why Q-Learning in particular became the load-bearing idea underneath the deep RL systems that learned to play Atari from raw pixels. But swapping that humble Q dictionary for a neural network opens a whole box of new problems, and that is a story for the next episode ;-)
r + gamma*V(s') -- no waiting for the episode to end, which makes it model-free and online at the same time;max_a' Q(s',a'), learning the optimal policy regardless of how the agent behaves -- the foundation that modern deep RL is built on;Exercise 1: Build a SARSA vs Q-Learning reward tracker on CliffWalking. Modify both functions to return the total reward per episode (a list, one entry per training episode). Train both for 500 episodes, then print a smoothed average (say, a moving window of 10 episodes) of each. You should see Q-Learning's curve sitting lower during training (it keeps falling off the cliff while exploring) even though its final greedy path is shorter. Explain in a sentence why the "worse" learner has the better optimal policy.
Exercise 2: Implement Expected SARSA. It is the missing third option: instead of bootstrapping from the next action taken (SARSA) or the max (Q-Learning), bootstrap from the expectation over next actions under the epsilon-greedy policy -- that is, (1 - epsilon) * max_a Q(s',a) + epsilon * mean_a Q(s',a). Run it on cliff walking and compare its behaviour and stability to plain SARSA. Which path does it learn, and is its training smoother?
Exercise 3: Take the n_step_td evaluator and turn it into an experiment. Pick a fixed policy on a small environment (a random walk on a short line of states is the textbook choice), then run n-step TD for n in [1, 2, 4, 8, 16, 32] and measure the error of the learned values against a long Monte Carlo reference run. Plot or print error against n. You should find a U-shape -- some intermediate n beats both TD(0) and full Monte Carlo. Report the n that won for your setup.