Learn AI Series (#112) - RL for Games

What will I learn

You will learn AlphaZero -- the single algorithm that mastered Chess, Go and Shogi with nothing but the rules and a stubborn habit of playing itself;
Monte Carlo Tree Search (MCTS), the planning machine that lets a network "think ahead" by simulating thousands of possible futures before committing to a move;
why Atari -- and one infamous game in particular -- forced the field to invent curiosity, because some worlds hand out almost no reward at all;
how StarCraft and Dota 2 dragged RL kicking and screaming into long horizons, hidden information and action spaces the size of a small country;
reward shaping -- how to nudge a learner with extra signals without accidentally teaching it to cheat;
and curriculum learning, the deceptively obvious idea of starting easy and turning the difficulty knob up only when the agent has earned it.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution with NumPy and PyTorch;
You've followed episodes #107 (DQN), #109 (PPO), #110 (model-based RL) and #111 (multi-agent self-play) -- today is where all four finally shake hands.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#112) - RL for Games

Solutions to Episode #111 Exercises

Before we let an agent loose on a chessboard, let's clear last episode's three exercises. They all reuse the IndependentAgent, IteratedPrisonersDilemma, SelfPlayTrainer and QMIXMixer classes from episode #111, so I'm assuming those are imported and in scope.

Exercise 1: Two IndependentAgents play the IteratedPrisonersDilemma for 5000 rounds, each one fed the last joint action as a 4-dim one-hot observation. Track the cooperation rate, and run it under three seeds to feel how seed-sensitive MARL really is.

import numpy as np
import random
# Assumes IndependentAgent and IteratedPrisonersDilemma from episode #111.


def joint_onehot(a1, a2):
    """Encode the previous joint action as a 4-dim one-hot observation."""
    v = np.zeros(4, dtype=np.float32)
    v[a1 * 2 + a2] = 1.0          # (0,0)->0 (0,1)->1 (1,0)->2 (1,1)->3
    return v


def run_ipd(seed, rounds=5000):
    random.seed(seed)
    np.random.seed(seed)
    env = IteratedPrisonersDilemma()
    a1 = IndependentAgent(obs_dim=4, n_actions=2, agent_id=0)
    a2 = IndependentAgent(obs_dim=4, n_actions=2, agent_id=1)

    obs = np.zeros(4, dtype=np.float32)   # no history on round 0
    coop_history = []
    for t in range(rounds):
        act1 = a1.choose_action(obs)
        act2 = a2.choose_action(obs)
        r1, r2 = env.step(act1, act2)
        nxt = joint_onehot(act1, act2)

        a1.store(obs, act1, r1, nxt, False)
        a2.store(obs, act2, r2, nxt, False)
        a1.learn(); a2.learn()

        # decay exploration so they can actually settle on a policy
        a1.epsilon = max(0.02, a1.epsilon * 0.999)
        a2.epsilon = max(0.02, a2.epsilon * 0.999)

        coop_history.append(int(act1 == 0 and act2 == 0))
        obs = nxt
    return np.mean(coop_history[-500:])   # cooperation rate at the end


for s in (0, 1, 2):
    print(f"seed={s}: final mutual-cooperation rate = {run_ipd(s):.2f}")

The honest result is the lesson: across the three seeds you'll typically get three different stories. One run drifts into the cosy (cooperate, cooperate) corner and stays there, another collapses into mutual defection, and a third never settles at all -- it oscillates, because each agent keeps "discovering" that defection pays right up until the other punishes it. This is non-stationarity from episode #111 made visible: there is no fixed target to converge to when your opponent is a moving part. Run it ten times and you'd get a distribution, not an answer -- which is exactly why serious MARL papers report results over many seeds and never trust a single lucky run.

Exercise 2: A minimal SelfPlayTrainer for rock-paper-scissors (one stateless step, three actions). Train with the historical opponent pool and without it, and watch the difference.

import numpy as np
import random


class RPSPolicy:
    """A stateless policy: just three logits over rock/paper/scissors."""
    def __init__(self):
        self.logits = np.zeros(3, dtype=np.float64)

    def probs(self):
        z = self.logits - self.logits.max()
        e = np.exp(z)
        return e / e.sum()

    def act(self):
        return np.random.choice(3, p=self.probs())


def beats(a, b):                       # +1 if a beats b, -1 if loses, 0 tie
    return 0 if a == b else (1 if (a - b) % 3 == 1 else -1)


def train_rps(use_pool, steps=20000, lr=0.05):
    cur = RPSPolicy()
    pool = []
    for t in range(steps):
        if use_pool and pool and random.random() < 0.5:
            opp = random.choice(pool)
        else:
            opp = cur
        a, b = cur.act(), opp.act()
        reward = beats(a, b)
        # crude policy-gradient nudge toward actions that won
        p = cur.probs()
        grad = -p
        grad[a] += 1.0
        cur.logits += lr * reward * grad
        if use_pool and t % 500 == 0:
            snap = RPSPolicy(); snap.logits = cur.logits.copy()
            pool.append(snap)
    return cur.probs()


print("no pool :", np.round(train_rps(False), 3))
print("with pool:", np.round(train_rps(True), 3))

The "with pool" run converges toward the famous (1/3, 1/3, 1/3) Nash mixture -- the only strategy that nobody can exploit -- because it's forced to stay strong against every past version of itself. The "no pool" run, always chasing the latest self, cycles instead: rock beats scissors, so it leans rock, so the copy leans paper, so it leans scissors, round and round forever. Same game, same code, one extra list of checkpoints, completely different outcome. That little pool is doing real work.

Exercise 3: A numerical check that the QMIXMixer is genuinely monotonic -- and a demonstration that the lone torch.abs is what makes it so.

import torch
# Assumes QMIXMixer from episode #111.


def monotonicity_violations(use_abs, trials=5000, n_agents=4, state_dim=8):
    mixer = QMIXMixer(n_agents, state_dim)
    if not use_abs:
        # Sabotage the monotonicity trick: patch torch.abs to a no-op.
        import builtins
        orig = torch.abs
        torch.abs = lambda x: x
    violations = 0
    for _ in range(trials):
        qs = torch.randn(1, n_agents)
        state = torch.randn(1, state_dim)
        q_before = mixer(qs, state)
        bump = qs.clone()
        bump[0, 0] += 0.1                 # raise ONE agent's Q-value
        q_after = mixer(bump, state)
        if q_after.item() < q_before.item() - 1e-6:
            violations += 1
    if not use_abs:
        torch.abs = orig
    return violations


print("violations WITH abs   :", monotonicity_violations(True))
print("violations WITHOUT abs:", monotonicity_violations(False))

With the torch.abs in place you should see a flat zero violations across all 5000 trials: raise any agent's Q-value, and Q_total never drops. Strip the abs out and the violations pour in, because now some mixing weights are negative and an agent's improvement can lower the team value. That's the whole game: monotonicity is what guarantees "each agent greedily picks its own best action" matches "the team picks its best joint action". One function call, load-bearing for the entire decentralized-execution story. Nota bene -- monkey-patching torch.abs like this is a teaching hack, not something you'd ever ship ;-)

Right -- on to the main event

So. Eleven episodes ago we couldn't get an agent to balance a stick. Today we're going to talk about agents that beat the best humans on the planet at the hardest games we ever invented. And here's the thing that still gives me a little thrill after all these years: games have been the proving ground for reinforcement learning since the very beginning, and almost every big idea in modern RL was born trying to win one.

A quick roll of honour. Tesauro's TD-Gammon taught itself Backgammon back in 1992 -- in stead of copying human play, it found moves that surprised grandmasters. DeepMind's DQN (our episode #107) chewed through dozens of Atari games straight from pixels in 2015. AlphaGo toppled the world Go champion in 2016, a decade earlier than anyone sane predicted. OpenAI Five dismantled professional Dota 2 teams in 2019, and AlphaStar hit Grandmaster in StarCraft II the same year.

Why games, though? Because they are very nearly perfect laboratories: the rules are crisp, the reward is unambiguous (you won, or you didn't), the simulator runs faster than real time, and difficulty comes pre-graded from "toddler" to "world champion". And -- this is the part that matters far beyond games -- the algorithms born here travel. AlphaZero's MCTS planning grew into MuZero's general model-based RL. PPO, invented to wrangle game agents, now trains language models. The toy was always a Trojan horse for something serious.

AlphaZero: one algorithm, three games, zero human knowledge

AlphaZero (Silver et al., 2018) is, to me, one of the most beautiful results in the whole field. The same code, with no game-specific tweaks, learned Chess, Go and Shogi from scratch -- three games with wildly different boards, pieces and tactics. No opening books, no endgame tablebases, no hand-crafted features. Just the rules, and an awful lot of self-play.

The recipe has three ingredients, and that's genuinely all:

A neural network that reads the board and outputs two things at once -- a policy (a probability over legal moves) and a value (who is winning, from -1 to +1);
Monte Carlo Tree Search that uses that network to look ahead intelligently;
Self-play (the episode #111 trick) that generates the training data out of thin air.

The network is a dual-headed beast sitting on a tower of residual blocks (remember those from the CNN episodes? they show up everywhere):

import torch
import torch.nn as nn
import numpy as np
import math


class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x
        x = torch.relu(self.bn1(self.conv1(x)))
        x = self.bn2(self.conv2(x))
        return torch.relu(x + residual)   # the skip connection


class AlphaZeroNetwork(nn.Module):
    """Dual-head network: a policy AND a value from one board state."""
    def __init__(self, board_size, n_actions, n_res_blocks=5, channels=128):
        super().__init__()
        # 17 input planes is the Go encoding (stone history + side to move).
        self.conv_input = nn.Sequential(
            nn.Conv2d(17, channels, 3, padding=1),
            nn.BatchNorm2d(channels), nn.ReLU(),
        )
        self.res_blocks = nn.ModuleList(
            [ResidualBlock(channels) for _ in range(n_res_blocks)])

        self.policy_head = nn.Sequential(            # probability over moves
            nn.Conv2d(channels, 2, 1), nn.BatchNorm2d(2), nn.ReLU(),
            nn.Flatten(),
            nn.Linear(2 * board_size * board_size, n_actions),
        )
        self.value_head = nn.Sequential(             # who is winning? (-1..1)
            nn.Conv2d(channels, 1, 1), nn.BatchNorm2d(1), nn.ReLU(),
            nn.Flatten(),
            nn.Linear(board_size * board_size, 256), nn.ReLU(),
            nn.Linear(256, 1), nn.Tanh(),
        )

    def forward(self, state):
        x = self.conv_input(state)
        for block in self.res_blocks:
            x = block(x)
        return self.policy_head(x), self.value_head(x)

Two heads, one body. The shared residual tower learns to "see" the board; the policy head turns that into a hunch about good moves, and the value head into a verdict about the outcome. Having said that, the network alone is a strong intuition but a weak player -- it reacts, it doesn't plan. The planning is where MCTS comes in.

Monte Carlo Tree Search: thinking by simulating

MCTS is how AlphaZero "thinks". Rather than glance at a position and blurt out a move, it imagines hundreds or thousands of continuations, growing a search tree, and uses the network to decide which branches are worth exploring. Four steps, repeated over and over: select a promising path down the tree, expand a new leaf, evaluate it with the network, and backpropagate the verdict up to the root.

class MCTSNode:
    """One node in the search tree."""
    def __init__(self, prior, parent=None):
        self.prior = prior            # P(a) suggested by the network
        self.visit_count = 0          # N(a)
        self.total_value = 0.0        # W(a)
        self.children = {}            # action -> child node
        self.parent = parent

    @property
    def mean_value(self):
        return 0.0 if self.visit_count == 0 else self.total_value / self.visit_count


class MCTS:
    """Monte Carlo Tree Search guided by a policy/value network."""
    def __init__(self, network, n_simulations=100, c_puct=1.0):
        self.network = network
        self.n_simulations = n_simulations
        self.c_puct = c_puct

    def search(self, game_state):
        root = MCTSNode(prior=0)
        for _ in range(self.n_simulations):
            node, state = root, game_state.clone()

            # 1. SELECT: walk down using the PUCT score
            while node.children and not state.is_terminal():
                action, node = self._select_child(node)
                state.apply_action(action)

            # 2. EXPAND + EVALUATE with the network
            if not state.is_terminal():
                policy, value = self._evaluate(state)
                for action in state.get_legal_actions():
                    node.children.setdefault(
                        action, MCTSNode(prior=policy[action], parent=node))
            else:
                value = state.get_reward()

            # 3. BACKPROPAGATE, flipping perspective each level up
            while node is not None:
                node.visit_count += 1
                node.total_value += value
                value = -value
                node = node.parent

        # The move played most often during search is the move we trust.
        visits = {a: c.visit_count for a, c in root.children.items()}
        total = sum(visits.values())
        return {a: n / total for a, n in visits.items()}

    def _select_child(self, node):
        """PUCT: exploit high value, explore high prior with few visits."""
        sqrt_parent = math.sqrt(node.visit_count)
        best_score, best = -float("inf"), (None, None)
        for action, child in node.children.items():
            exploit = child.mean_value
            explore = self.c_puct * child.prior * sqrt_parent / (1 + child.visit_count)
            score = exploit + explore
            if score > best_score:
                best_score, best = score, (action, child)
        return best

    def _evaluate(self, state):
        with torch.no_grad():
            logits, value = self.network(state.to_tensor().unsqueeze(0))
        policy = torch.softmax(logits, dim=1).squeeze().numpy()
        return policy, value.item()

The magic is in that PUCT formula inside _select_child. It balances two urges: exploit the moves that have looked good so far (high mean_value), and explore the moves the network thinks are promising but we've barely tried (high prior, low visit_count). The network prior is what makes this tractable -- Go has roughly 250 legal moves per turn and games go 150+ moves deep, so brute-force search is hopeless. The prior whispers "don't bother with these obviously terrible moves" and the tree spends its precious simulations where they count. And notice the final return value: not the highest-value move, but the most-visited one. The move the search kept coming back to is the move it trusts, because visit count is a vote weighted by everything the search learned.

The training loop: the dog that teaches itself new tricks

Here's where the whole thing becomes a perpetual-motion machine (well, almost). Self-play with the current network generates games. Each position, paired with the MCTS visit-distribution and the eventual game result, becomes a training example. The network trains on that, gets a little sharper, which makes MCTS a little sharper, which produces better games, which... you see where this goes.

def alphazero_training_loop(game, network, n_iterations=100,
                            games_per_iteration=100, mcts_sims=200):
    optimizer = torch.optim.Adam(network.parameters(), lr=1e-3, weight_decay=1e-4)

    for iteration in range(n_iterations):
        # Phase 1: self-play generates fresh training data
        training_data = []
        mcts = MCTS(network, n_simulations=mcts_sims)
        for _ in range(games_per_iteration):
            state = game.new_game()
            history = []
            while not state.is_terminal():
                pi = mcts.search(state)
                history.append((state.to_tensor(), pi))
                actions = list(pi.keys())
                action = np.random.choice(actions, p=[pi[a] for a in actions])
                state.apply_action(action)
            # Hand the final result back to every position, alternating sign
            result = state.get_reward()
            for board_tensor, pi in history:
                training_data.append((board_tensor, pi, result))
                result = -result

        # Phase 2: train on the self-play data
        for epoch in range(10):
            np.random.shuffle(training_data)
            for start in range(0, len(training_data), 64):
                batch = training_data[start:start + 64]
                boards, pis, results = zip(*batch)
                boards_t = torch.stack(boards)
                results_t = torch.FloatTensor(results).unsqueeze(1)

                logits, values = network(boards_t)
                # Policy loss: cross-entropy against the MCTS visit-distribution
                targets = torch.zeros_like(logits)
                for i, pi in enumerate(pis):
                    for action, prob in pi.items():
                        targets[i, action] = prob
                policy_loss = -(targets * torch.log_softmax(logits, dim=1)).sum(1).mean()
                # Value loss: MSE against who actually won
                value_loss = nn.functional.mse_loss(values, results_t)

                loss = policy_loss + value_loss
                optimizer.zero_grad(); loss.backward(); optimizer.step()
        print(f"Iteration {iteration}: {len(training_data)} positions generated")

The numbers are worth sitting with. AlphaZero trained roughly 9 hours on Chess, 12 hours on Shogi and 13 days on Go -- on a single machine with 4 TPUs -- and the resulting players were stronger than the best specialised engines humanity had built over decades. Stockfish for Chess, the dedicated Go programs, all of it, beaten by one general recipe that started from random play. The first time I read that paper I had to put it down and go for a walk ;-)

Beyond the board: Atari and the curse of sparse reward

Board games are tidy -- perfect information, discrete moves, take-turns. Atari is messier: real-time, raw pixels, partial information. DQN (episode #107) cracked dozens of Atari games, but a handful stayed stubbornly unsolved, and the poster child for "stubborn" is Montezuma's Revenge.

The problem is sparse reward. In Montezuma's Revenge the agent must climb down ladders, jump gaps, grab a key, cross a room, unlock a door -- hundreds of correct steps -- before a single point arrives. Plain epsilon-greedy DQN just rattles around randomly, never stumbling on that long lucky sequence, and so it never sees a reward to learn from. It's like trying to learn the piano when the only feedback is a standing ovation at the end of a concert you can't yet play.

The fix that stuck is giving the agent its own internal motivation -- a curiosity bonus for visiting states it can't yet predict:

class ICMModule(nn.Module):
    """Intrinsic Curiosity Module: reward the agent for surprises."""
    def __init__(self, state_dim, action_dim, feature_dim=64):
        super().__init__()
        self.feature_net = nn.Sequential(      # compress states into features
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, feature_dim),
        )
        self.forward_model = nn.Sequential(    # predict next feature from this + action
            nn.Linear(feature_dim + action_dim, 128), nn.ReLU(),
            nn.Linear(128, feature_dim),
        )

    def compute_intrinsic_reward(self, state, action, next_state):
        phi_s = self.feature_net(state)
        phi_ns = self.feature_net(next_state)
        pred = self.forward_model(torch.cat([phi_s, action], dim=-1))
        # Big prediction error = big surprise = big intrinsic reward
        return 0.5 * (pred - phi_ns.detach()).pow(2).sum(dim=-1)

The trick is self-balancing. States the agent can't predict are novel, so they pay a curiosity bonus and the agent goes looking. But as it explores them the forward model learns them, the prediction error shrinks, and those once-exciting states become boring -- pushing the agent toward the next unexplored frontier. It's an artificial sense of wonder that automatically fades where it's been satisfied. Count-based exploration and Random Network Distillation are cousins of this idea, and together they finally got agents through that wretched first room.

Real-time strategy: where RL meets the meat grinder

If board games are tidy and Atari is messy, StarCraft II and Dota 2 are the meat grinder. They pile on every hard thing at once: games lasting 30-45 minutes (long time horizons), a fog of war hiding the opponent (imperfect information), continuous real-time decisions, and action spaces so vast you can barely write them down.

OpenAI Five tackled Dota 2 by throwing PPO (episode #109) at the problem at a frankly absurd scale: 256 GPUs and 128,000 CPU cores, playing the equivalent of 180 years of Dota every single day for ten months. A few of the tricks that made it work:

Patience as a hyperparameter: a Dota game is ~20,000 decisions long, so they used a discount factor of gamma = 0.9997 -- an agent that genuinely cares about a reward arising thousands of steps in the future.
Taming the action space: the raw continuous control was discretised into roughly a thousand sensible options per step, so the policy had something finite to choose from.
Team reward: all five heroes on a team shared one reward signal -- cooperative MARL straight out of episode #111, learned at industrial scale.

AlphaStar went at StarCraft II differently: a transformer-based policy (yes, the same attention machinery from episodes #51-53) reading the game as a set of entities, trained first on human replays and then refined with self-play inside a league of deliberately diverse agents -- some aggressive, some weird, some built only to expose the main agent's blind spots. It reached Grandmaster, top 0.2% of human players, in all three races. The league is the key idea: a single self-play line can get stuck in a rut, but a whole ecosystem of rivals keeps everyone honest.

Reward shaping: helpful nudges and how they backfire

Raw game rewards are often too sparse or too delayed to learn from quickly, so we're tempted to add intermediate signals -- a little reward for picking up the key, for moving toward the goal, for staying alive. This is reward shaping, and it is gloriously easy to get wrong. Shape it carelessly and the agent will gleefully maximise your shaping signal in stead of the actual objective -- circling the key forever because approaching it pays, never bothering to pick it up.

There is exactly one form of shaping proven safe, and it's worth knowing by name: potential-based shaping (Ng et al., 1999).

def shaped_reward(state, next_state, raw_reward, gamma=0.99):
    """Potential-based shaping: provably keeps the optimal policy intact.
    F(s, s') = gamma * Phi(s') - Phi(s)  (Ng, Harada & Russell, 1999)."""
    return raw_reward + (gamma * potential(next_state) - potential(state))


def potential(state):
    """A heuristic 'goodness' of a state -- here, closeness to the goal."""
    return -np.linalg.norm(state.position - state.goal_position)

The reason this particular shape is safe is subtle and lovely: because the bonus is the difference of a potential across a transition, it telescopes away over any complete path and cancels out of every loop. So it can speed learning -- pointing the agent roughly toward the goal -- without ever changing which policy is optimal. Any other shaping scheme risks quietly rewriting the objective. In practice plenty of successful projects use non-potential shaping anyway, carefully, accepting the theoretical risk for the practical speed-up -- but they do it with their eyes open, and so should you.

Curriculum learning: don't throw the toddler in the deep end

The last idea is the most human one. Instead of dropping an agent into the full, brutal game from step one, curriculum learning starts it on an easy version and turns the difficulty up only as it earns the right:

class CurriculumScheduler:
    """Raise the difficulty as the agent's win rate clears a threshold."""
    def __init__(self, difficulties, threshold=0.7):
        self.difficulties = difficulties      # easy -> hard
        self.current_level = 0
        self.threshold = threshold            # win rate needed to advance
        self.recent = []

    def get_difficulty(self):
        return self.difficulties[self.current_level]

    def report_result(self, won: bool):
        self.recent.append(won)
        if len(self.recent) > 100:
            self.recent.pop(0)
        win_rate = sum(self.recent) / len(self.recent)
        if win_rate > self.threshold and self.current_level < len(self.difficulties) - 1:
            self.current_level += 1
            self.recent = []
            print(f"Advancing to level {self.current_level}")

OpenAI Five leaned on this hard: weaker opponents first, fewer heroes, simplified mechanics, then gradually the full chaos. Without a curriculum the initial exploration problem was simply too hard -- a random agent couldn't blunder into a win often enough to ever learn what winning looks like. Start easy and the early wins come freely; each one teaches just enough to survive the next rung. It's the sparse-reward problem solved with patience in stead of curiosity -- and often the two are used together.

Pull all six of these threads together and you've got the modern game-playing toolkit: a network for intuition, MCTS for planning, self-play for opponents, curiosity for exploration, shaping for guidance and a curriculum for pacing. The same kit, pointed away from games, is what we'll start aiming at messier real-world problems before long -- but every one of these ideas was forged on a scoreboard first.

In case you skimmed...

AlphaZero mastered Chess, Go and Shogi with one algorithm -- a dual-head policy/value network guided by MCTS, trained purely by self-play, knowing nothing but the rules;
MCTS plans by simulating thousands of futures, using the PUCT formula to balance exploiting good moves against exploring promising-but-untried ones, and trusts the most-visited move rather than the highest-scoring guess;
sparse-reward games like Montezuma's Revenge need intrinsic motivation (curiosity) because the external reward is too rare for ordinary exploration to ever find;
real-time strategy games (Dota 2, StarCraft II) push RL to its limits -- long horizons, hidden information, vast action spaces -- and were cracked with enormous compute, PPO, self-play leagues and a lot of engineering;
reward shaping adds helpful signals, but only potential-based shaping is proven not to corrupt the optimal policy -- everything else is a calculated risk;
curriculum learning starts easy and ramps difficulty as the agent improves, dissolving exploration problems that are otherwise hopeless.

Exercises

Exercise 1: Implement Tic-Tac-Toe as a tiny game object exposing clone(), is_terminal(), get_legal_actions(), get_reward(), apply_action() and to_tensor(), then run the MCTS class above on it with a random evaluator (skip the network -- just return a uniform policy and a random value). Play 200 games of pure-MCTS against a random opponent and report the win rate. You should see MCTS alone, with no learning at all, already crush random play -- planning is powerful even before training enters the picture.

Exercise 2: Take your Tic-Tac-Toe game and wire up the full alphazero_training_loop with a small AlphaZeroNetwork (shrink the board planes to fit). Train for a few dozen iterations and plot the loss. Then play the trained agent against the pure-MCTS agent from Exercise 1 using the same number of simulations. Does the network-guided search win? By how much? This is the whole AlphaZero thesis in miniature -- intuition plus search beats search alone.

Exercise 3: Bolt the ICMModule onto a DQN agent (reuse your DQN from episode #107) on MountainCar-v0 -- a classic sparse-reward environment where the car only scores by reaching the flag. Train once with the intrinsic curiosity bonus added to the reward and once without. Compare how many episodes each takes to first reach the goal. You'll feel, first-hand, why curiosity exists: the plain agent may never get there at all, while the curious one goes looking.

Get the Tic-Tac-Toe pipeline working before the next episode -- a board you can beat your own agent on is the best possible intuition for everything we've built, and the techniques here stop being academic the moment you watch your own code learn to play. The next stop takes this game-playing machinery and starts pointing it at problems that don't come with a tidy scoreboard at all.

Bedankt en tot de volgende keer!

@scipio

Learn AI Series (#112) - RL for Games

Learn AI Series (#112) - RL for Games

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#112) - RL for Games

Solutions to Episode #111 Exercises

Right -- on to the main event

AlphaZero: one algorithm, three games, zero human knowledge

Monte Carlo Tree Search: thinking by simulating

The training loop: the dog that teaches itself new tricks

Beyond the board: Atari and the curse of sparse reward

Real-time strategy: where RL meets the meat grinder

Reward shaping: helpful nudges and how they backfire

Curriculum learning: don't throw the toddler in the deep end

In case you skimmed...

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):