Learn AI Series):Before we let an agent loose on a chessboard, let's clear last episode's three exercises. They all reuse the IndependentAgent, IteratedPrisonersDilemma, SelfPlayTrainer and QMIXMixer classes from episode #111, so I'm assuming those are imported and in scope.
Exercise 1: Two IndependentAgents play the IteratedPrisonersDilemma for 5000 rounds, each one fed the last joint action as a 4-dim one-hot observation. Track the cooperation rate, and run it under three seeds to feel how seed-sensitive MARL really is.
import numpy as np
import random
# Assumes IndependentAgent and IteratedPrisonersDilemma from episode #111.
def joint_onehot(a1, a2):
"""Encode the previous joint action as a 4-dim one-hot observation."""
v = np.zeros(4, dtype=np.float32)
v[a1 * 2 + a2] = 1.0 # (0,0)->0 (0,1)->1 (1,0)->2 (1,1)->3
return v
def run_ipd(seed, rounds=5000):
random.seed(seed)
np.random.seed(seed)
env = IteratedPrisonersDilemma()
a1 = IndependentAgent(obs_dim=4, n_actions=2, agent_id=0)
a2 = IndependentAgent(obs_dim=4, n_actions=2, agent_id=1)
obs = np.zeros(4, dtype=np.float32) # no history on round 0
coop_history = []
for t in range(rounds):
act1 = a1.choose_action(obs)
act2 = a2.choose_action(obs)
r1, r2 = env.step(act1, act2)
nxt = joint_onehot(act1, act2)
a1.store(obs, act1, r1, nxt, False)
a2.store(obs, act2, r2, nxt, False)
a1.learn(); a2.learn()
# decay exploration so they can actually settle on a policy
a1.epsilon = max(0.02, a1.epsilon * 0.999)
a2.epsilon = max(0.02, a2.epsilon * 0.999)
coop_history.append(int(act1 == 0 and act2 == 0))
obs = nxt
return np.mean(coop_history[-500:]) # cooperation rate at the end
for s in (0, 1, 2):
print(f"seed={s}: final mutual-cooperation rate = {run_ipd(s):.2f}")
The honest result is the lesson: across the three seeds you'll typically get three different stories. One run drifts into the cosy (cooperate, cooperate) corner and stays there, another collapses into mutual defection, and a third never settles at all -- it oscillates, because each agent keeps "discovering" that defection pays right up until the other punishes it. This is non-stationarity from episode #111 made visible: there is no fixed target to converge to when your opponent is a moving part. Run it ten times and you'd get a distribution, not an answer -- which is exactly why serious MARL papers report results over many seeds and never trust a single lucky run.
Exercise 2: A minimal SelfPlayTrainer for rock-paper-scissors (one stateless step, three actions). Train with the historical opponent pool and without it, and watch the difference.
import numpy as np
import random
class RPSPolicy:
"""A stateless policy: just three logits over rock/paper/scissors."""
def __init__(self):
self.logits = np.zeros(3, dtype=np.float64)
def probs(self):
z = self.logits - self.logits.max()
e = np.exp(z)
return e / e.sum()
def act(self):
return np.random.choice(3, p=self.probs())
def beats(a, b): # +1 if a beats b, -1 if loses, 0 tie
return 0 if a == b else (1 if (a - b) % 3 == 1 else -1)
def train_rps(use_pool, steps=20000, lr=0.05):
cur = RPSPolicy()
pool = []
for t in range(steps):
if use_pool and pool and random.random() < 0.5:
opp = random.choice(pool)
else:
opp = cur
a, b = cur.act(), opp.act()
reward = beats(a, b)
# crude policy-gradient nudge toward actions that won
p = cur.probs()
grad = -p
grad[a] += 1.0
cur.logits += lr * reward * grad
if use_pool and t % 500 == 0:
snap = RPSPolicy(); snap.logits = cur.logits.copy()
pool.append(snap)
return cur.probs()
print("no pool :", np.round(train_rps(False), 3))
print("with pool:", np.round(train_rps(True), 3))
The "with pool" run converges toward the famous (1/3, 1/3, 1/3) Nash mixture -- the only strategy that nobody can exploit -- because it's forced to stay strong against every past version of itself. The "no pool" run, always chasing the latest self, cycles instead: rock beats scissors, so it leans rock, so the copy leans paper, so it leans scissors, round and round forever. Same game, same code, one extra list of checkpoints, completely different outcome. That little pool is doing real work.
Exercise 3: A numerical check that the QMIXMixer is genuinely monotonic -- and a demonstration that the lone torch.abs is what makes it so.
import torch
# Assumes QMIXMixer from episode #111.
def monotonicity_violations(use_abs, trials=5000, n_agents=4, state_dim=8):
mixer = QMIXMixer(n_agents, state_dim)
if not use_abs:
# Sabotage the monotonicity trick: patch torch.abs to a no-op.
import builtins
orig = torch.abs
torch.abs = lambda x: x
violations = 0
for _ in range(trials):
qs = torch.randn(1, n_agents)
state = torch.randn(1, state_dim)
q_before = mixer(qs, state)
bump = qs.clone()
bump[0, 0] += 0.1 # raise ONE agent's Q-value
q_after = mixer(bump, state)
if q_after.item() < q_before.item() - 1e-6:
violations += 1
if not use_abs:
torch.abs = orig
return violations
print("violations WITH abs :", monotonicity_violations(True))
print("violations WITHOUT abs:", monotonicity_violations(False))
With the torch.abs in place you should see a flat zero violations across all 5000 trials: raise any agent's Q-value, and Q_total never drops. Strip the abs out and the violations pour in, because now some mixing weights are negative and an agent's improvement can lower the team value. That's the whole game: monotonicity is what guarantees "each agent greedily picks its own best action" matches "the team picks its best joint action". One function call, load-bearing for the entire decentralized-execution story. Nota bene -- monkey-patching torch.abs like this is a teaching hack, not something you'd ever ship ;-)
So. Eleven episodes ago we couldn't get an agent to balance a stick. Today we're going to talk about agents that beat the best humans on the planet at the hardest games we ever invented. And here's the thing that still gives me a little thrill after all these years: games have been the proving ground for reinforcement learning since the very beginning, and almost every big idea in modern RL was born trying to win one.
A quick roll of honour. Tesauro's TD-Gammon taught itself Backgammon back in 1992 -- in stead of copying human play, it found moves that surprised grandmasters. DeepMind's DQN (our episode #107) chewed through dozens of Atari games straight from pixels in 2015. AlphaGo toppled the world Go champion in 2016, a decade earlier than anyone sane predicted. OpenAI Five dismantled professional Dota 2 teams in 2019, and AlphaStar hit Grandmaster in StarCraft II the same year.
Why games, though? Because they are very nearly perfect laboratories: the rules are crisp, the reward is unambiguous (you won, or you didn't), the simulator runs faster than real time, and difficulty comes pre-graded from "toddler" to "world champion". And -- this is the part that matters far beyond games -- the algorithms born here travel. AlphaZero's MCTS planning grew into MuZero's general model-based RL. PPO, invented to wrangle game agents, now trains language models. The toy was always a Trojan horse for something serious.
AlphaZero (Silver et al., 2018) is, to me, one of the most beautiful results in the whole field. The same code, with no game-specific tweaks, learned Chess, Go and Shogi from scratch -- three games with wildly different boards, pieces and tactics. No opening books, no endgame tablebases, no hand-crafted features. Just the rules, and an awful lot of self-play.
The recipe has three ingredients, and that's genuinely all:
The network is a dual-headed beast sitting on a tower of residual blocks (remember those from the CNN episodes? they show up everywhere):
import torch
import torch.nn as nn
import numpy as np
import math
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
residual = x
x = torch.relu(self.bn1(self.conv1(x)))
x = self.bn2(self.conv2(x))
return torch.relu(x + residual) # the skip connection
class AlphaZeroNetwork(nn.Module):
"""Dual-head network: a policy AND a value from one board state."""
def __init__(self, board_size, n_actions, n_res_blocks=5, channels=128):
super().__init__()
# 17 input planes is the Go encoding (stone history + side to move).
self.conv_input = nn.Sequential(
nn.Conv2d(17, channels, 3, padding=1),
nn.BatchNorm2d(channels), nn.ReLU(),
)
self.res_blocks = nn.ModuleList(
[ResidualBlock(channels) for _ in range(n_res_blocks)])
self.policy_head = nn.Sequential( # probability over moves
nn.Conv2d(channels, 2, 1), nn.BatchNorm2d(2), nn.ReLU(),
nn.Flatten(),
nn.Linear(2 * board_size * board_size, n_actions),
)
self.value_head = nn.Sequential( # who is winning? (-1..1)
nn.Conv2d(channels, 1, 1), nn.BatchNorm2d(1), nn.ReLU(),
nn.Flatten(),
nn.Linear(board_size * board_size, 256), nn.ReLU(),
nn.Linear(256, 1), nn.Tanh(),
)
def forward(self, state):
x = self.conv_input(state)
for block in self.res_blocks:
x = block(x)
return self.policy_head(x), self.value_head(x)
Two heads, one body. The shared residual tower learns to "see" the board; the policy head turns that into a hunch about good moves, and the value head into a verdict about the outcome. Having said that, the network alone is a strong intuition but a weak player -- it reacts, it doesn't plan. The planning is where MCTS comes in.
MCTS is how AlphaZero "thinks". Rather than glance at a position and blurt out a move, it imagines hundreds or thousands of continuations, growing a search tree, and uses the network to decide which branches are worth exploring. Four steps, repeated over and over: select a promising path down the tree, expand a new leaf, evaluate it with the network, and backpropagate the verdict up to the root.
class MCTSNode:
"""One node in the search tree."""
def __init__(self, prior, parent=None):
self.prior = prior # P(a) suggested by the network
self.visit_count = 0 # N(a)
self.total_value = 0.0 # W(a)
self.children = {} # action -> child node
self.parent = parent
@property
def mean_value(self):
return 0.0 if self.visit_count == 0 else self.total_value / self.visit_count
class MCTS:
"""Monte Carlo Tree Search guided by a policy/value network."""
def __init__(self, network, n_simulations=100, c_puct=1.0):
self.network = network
self.n_simulations = n_simulations
self.c_puct = c_puct
def search(self, game_state):
root = MCTSNode(prior=0)
for _ in range(self.n_simulations):
node, state = root, game_state.clone()
# 1. SELECT: walk down using the PUCT score
while node.children and not state.is_terminal():
action, node = self._select_child(node)
state.apply_action(action)
# 2. EXPAND + EVALUATE with the network
if not state.is_terminal():
policy, value = self._evaluate(state)
for action in state.get_legal_actions():
node.children.setdefault(
action, MCTSNode(prior=policy[action], parent=node))
else:
value = state.get_reward()
# 3. BACKPROPAGATE, flipping perspective each level up
while node is not None:
node.visit_count += 1
node.total_value += value
value = -value
node = node.parent
# The move played most often during search is the move we trust.
visits = {a: c.visit_count for a, c in root.children.items()}
total = sum(visits.values())
return {a: n / total for a, n in visits.items()}
def _select_child(self, node):
"""PUCT: exploit high value, explore high prior with few visits."""
sqrt_parent = math.sqrt(node.visit_count)
best_score, best = -float("inf"), (None, None)
for action, child in node.children.items():
exploit = child.mean_value
explore = self.c_puct * child.prior * sqrt_parent / (1 + child.visit_count)
score = exploit + explore
if score > best_score:
best_score, best = score, (action, child)
return best
def _evaluate(self, state):
with torch.no_grad():
logits, value = self.network(state.to_tensor().unsqueeze(0))
policy = torch.softmax(logits, dim=1).squeeze().numpy()
return policy, value.item()
The magic is in that PUCT formula inside _select_child. It balances two urges: exploit the moves that have looked good so far (high mean_value), and explore the moves the network thinks are promising but we've barely tried (high prior, low visit_count). The network prior is what makes this tractable -- Go has roughly 250 legal moves per turn and games go 150+ moves deep, so brute-force search is hopeless. The prior whispers "don't bother with these obviously terrible moves" and the tree spends its precious simulations where they count. And notice the final return value: not the highest-value move, but the most-visited one. The move the search kept coming back to is the move it trusts, because visit count is a vote weighted by everything the search learned.
Here's where the whole thing becomes a perpetual-motion machine (well, almost). Self-play with the current network generates games. Each position, paired with the MCTS visit-distribution and the eventual game result, becomes a training example. The network trains on that, gets a little sharper, which makes MCTS a little sharper, which produces better games, which... you see where this goes.
def alphazero_training_loop(game, network, n_iterations=100,
games_per_iteration=100, mcts_sims=200):
optimizer = torch.optim.Adam(network.parameters(), lr=1e-3, weight_decay=1e-4)
for iteration in range(n_iterations):
# Phase 1: self-play generates fresh training data
training_data = []
mcts = MCTS(network, n_simulations=mcts_sims)
for _ in range(games_per_iteration):
state = game.new_game()
history = []
while not state.is_terminal():
pi = mcts.search(state)
history.append((state.to_tensor(), pi))
actions = list(pi.keys())
action = np.random.choice(actions, p=[pi[a] for a in actions])
state.apply_action(action)
# Hand the final result back to every position, alternating sign
result = state.get_reward()
for board_tensor, pi in history:
training_data.append((board_tensor, pi, result))
result = -result
# Phase 2: train on the self-play data
for epoch in range(10):
np.random.shuffle(training_data)
for start in range(0, len(training_data), 64):
batch = training_data[start:start + 64]
boards, pis, results = zip(*batch)
boards_t = torch.stack(boards)
results_t = torch.FloatTensor(results).unsqueeze(1)
logits, values = network(boards_t)
# Policy loss: cross-entropy against the MCTS visit-distribution
targets = torch.zeros_like(logits)
for i, pi in enumerate(pis):
for action, prob in pi.items():
targets[i, action] = prob
policy_loss = -(targets * torch.log_softmax(logits, dim=1)).sum(1).mean()
# Value loss: MSE against who actually won
value_loss = nn.functional.mse_loss(values, results_t)
loss = policy_loss + value_loss
optimizer.zero_grad(); loss.backward(); optimizer.step()
print(f"Iteration {iteration}: {len(training_data)} positions generated")
The numbers are worth sitting with. AlphaZero trained roughly 9 hours on Chess, 12 hours on Shogi and 13 days on Go -- on a single machine with 4 TPUs -- and the resulting players were stronger than the best specialised engines humanity had built over decades. Stockfish for Chess, the dedicated Go programs, all of it, beaten by one general recipe that started from random play. The first time I read that paper I had to put it down and go for a walk ;-)
Board games are tidy -- perfect information, discrete moves, take-turns. Atari is messier: real-time, raw pixels, partial information. DQN (episode #107) cracked dozens of Atari games, but a handful stayed stubbornly unsolved, and the poster child for "stubborn" is Montezuma's Revenge.
The problem is sparse reward. In Montezuma's Revenge the agent must climb down ladders, jump gaps, grab a key, cross a room, unlock a door -- hundreds of correct steps -- before a single point arrives. Plain epsilon-greedy DQN just rattles around randomly, never stumbling on that long lucky sequence, and so it never sees a reward to learn from. It's like trying to learn the piano when the only feedback is a standing ovation at the end of a concert you can't yet play.
The fix that stuck is giving the agent its own internal motivation -- a curiosity bonus for visiting states it can't yet predict:
class ICMModule(nn.Module):
"""Intrinsic Curiosity Module: reward the agent for surprises."""
def __init__(self, state_dim, action_dim, feature_dim=64):
super().__init__()
self.feature_net = nn.Sequential( # compress states into features
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
self.forward_model = nn.Sequential( # predict next feature from this + action
nn.Linear(feature_dim + action_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
def compute_intrinsic_reward(self, state, action, next_state):
phi_s = self.feature_net(state)
phi_ns = self.feature_net(next_state)
pred = self.forward_model(torch.cat([phi_s, action], dim=-1))
# Big prediction error = big surprise = big intrinsic reward
return 0.5 * (pred - phi_ns.detach()).pow(2).sum(dim=-1)
The trick is self-balancing. States the agent can't predict are novel, so they pay a curiosity bonus and the agent goes looking. But as it explores them the forward model learns them, the prediction error shrinks, and those once-exciting states become boring -- pushing the agent toward the next unexplored frontier. It's an artificial sense of wonder that automatically fades where it's been satisfied. Count-based exploration and Random Network Distillation are cousins of this idea, and together they finally got agents through that wretched first room.
If board games are tidy and Atari is messy, StarCraft II and Dota 2 are the meat grinder. They pile on every hard thing at once: games lasting 30-45 minutes (long time horizons), a fog of war hiding the opponent (imperfect information), continuous real-time decisions, and action spaces so vast you can barely write them down.
OpenAI Five tackled Dota 2 by throwing PPO (episode #109) at the problem at a frankly absurd scale: 256 GPUs and 128,000 CPU cores, playing the equivalent of 180 years of Dota every single day for ten months. A few of the tricks that made it work:
gamma = 0.9997 -- an agent that genuinely cares about a reward arising thousands of steps in the future.AlphaStar went at StarCraft II differently: a transformer-based policy (yes, the same attention machinery from episodes #51-53) reading the game as a set of entities, trained first on human replays and then refined with self-play inside a league of deliberately diverse agents -- some aggressive, some weird, some built only to expose the main agent's blind spots. It reached Grandmaster, top 0.2% of human players, in all three races. The league is the key idea: a single self-play line can get stuck in a rut, but a whole ecosystem of rivals keeps everyone honest.
Raw game rewards are often too sparse or too delayed to learn from quickly, so we're tempted to add intermediate signals -- a little reward for picking up the key, for moving toward the goal, for staying alive. This is reward shaping, and it is gloriously easy to get wrong. Shape it carelessly and the agent will gleefully maximise your shaping signal in stead of the actual objective -- circling the key forever because approaching it pays, never bothering to pick it up.
There is exactly one form of shaping proven safe, and it's worth knowing by name: potential-based shaping (Ng et al., 1999).
def shaped_reward(state, next_state, raw_reward, gamma=0.99):
"""Potential-based shaping: provably keeps the optimal policy intact.
F(s, s') = gamma * Phi(s') - Phi(s) (Ng, Harada & Russell, 1999)."""
return raw_reward + (gamma * potential(next_state) - potential(state))
def potential(state):
"""A heuristic 'goodness' of a state -- here, closeness to the goal."""
return -np.linalg.norm(state.position - state.goal_position)
The reason this particular shape is safe is subtle and lovely: because the bonus is the difference of a potential across a transition, it telescopes away over any complete path and cancels out of every loop. So it can speed learning -- pointing the agent roughly toward the goal -- without ever changing which policy is optimal. Any other shaping scheme risks quietly rewriting the objective. In practice plenty of successful projects use non-potential shaping anyway, carefully, accepting the theoretical risk for the practical speed-up -- but they do it with their eyes open, and so should you.
The last idea is the most human one. Instead of dropping an agent into the full, brutal game from step one, curriculum learning starts it on an easy version and turns the difficulty up only as it earns the right:
class CurriculumScheduler:
"""Raise the difficulty as the agent's win rate clears a threshold."""
def __init__(self, difficulties, threshold=0.7):
self.difficulties = difficulties # easy -> hard
self.current_level = 0
self.threshold = threshold # win rate needed to advance
self.recent = []
def get_difficulty(self):
return self.difficulties[self.current_level]
def report_result(self, won: bool):
self.recent.append(won)
if len(self.recent) > 100:
self.recent.pop(0)
win_rate = sum(self.recent) / len(self.recent)
if win_rate > self.threshold and self.current_level < len(self.difficulties) - 1:
self.current_level += 1
self.recent = []
print(f"Advancing to level {self.current_level}")
OpenAI Five leaned on this hard: weaker opponents first, fewer heroes, simplified mechanics, then gradually the full chaos. Without a curriculum the initial exploration problem was simply too hard -- a random agent couldn't blunder into a win often enough to ever learn what winning looks like. Start easy and the early wins come freely; each one teaches just enough to survive the next rung. It's the sparse-reward problem solved with patience in stead of curiosity -- and often the two are used together.
Pull all six of these threads together and you've got the modern game-playing toolkit: a network for intuition, MCTS for planning, self-play for opponents, curiosity for exploration, shaping for guidance and a curriculum for pacing. The same kit, pointed away from games, is what we'll start aiming at messier real-world problems before long -- but every one of these ideas was forged on a scoreboard first.
Exercise 1: Implement Tic-Tac-Toe as a tiny game object exposing clone(), is_terminal(), get_legal_actions(), get_reward(), apply_action() and to_tensor(), then run the MCTS class above on it with a random evaluator (skip the network -- just return a uniform policy and a random value). Play 200 games of pure-MCTS against a random opponent and report the win rate. You should see MCTS alone, with no learning at all, already crush random play -- planning is powerful even before training enters the picture.
Exercise 2: Take your Tic-Tac-Toe game and wire up the full alphazero_training_loop with a small AlphaZeroNetwork (shrink the board planes to fit). Train for a few dozen iterations and plot the loss. Then play the trained agent against the pure-MCTS agent from Exercise 1 using the same number of simulations. Does the network-guided search win? By how much? This is the whole AlphaZero thesis in miniature -- intuition plus search beats search alone.
Exercise 3: Bolt the ICMModule onto a DQN agent (reuse your DQN from episode #107) on MountainCar-v0 -- a classic sparse-reward environment where the car only scores by reaching the flag. Train once with the intrinsic curiosity bonus added to the reward and once without. Compare how many episodes each takes to first reach the goal. You'll feel, first-hand, why curiosity exists: the plain agent may never get there at all, while the curious one goes looking.
Get the Tic-Tac-Toe pipeline working before the next episode -- a board you can beat your own agent on is the best possible intuition for everything we've built, and the techniques here stop being academic the moment you watch your own code learn to play. The next stop takes this game-playing machinery and starts pointing it at problems that don't come with a tidy scoreboard at all.