Learn AI Series):Before we let a second agent into the room, let's clear last episode's three exercises. All of them reuse the DynaQ, EnvironmentModel and ModelTrainer classes from episode #110, so I'm assuming those are imported and in scope. As usual I'm leaning on gymnasium throughout (pip install gymnasium if you skipped it).
Exercise 1: Run tabular DynaQ on FrozenLake-v1 with is_slippery=False, train it with planning_steps = 0, 5, 50 under the same seed, plot episodes-to-solve, and explain why the benefit eventually plateaus.
import gymnasium as gym
import numpy as np
# Assumes the DynaQ class from episode #110 is imported.
def episodes_to_solve(planning_steps, seed, target=0.9, window=20, max_ep=2000):
env = gym.make("FrozenLake-v1", is_slippery=False)
agent = DynaQ(env.observation_space.n, env.action_space.n,
planning_steps=planning_steps)
np.random.seed(seed)
recent = []
for ep in range(max_ep):
state, _ = env.reset(seed=seed + ep)
done = False
total = 0.0
while not done:
action = agent.choose_action(state)
nxt, reward, term, trunc, _ = env.step(action)
done = term or trunc
agent.update(state, action, reward, nxt, float(done))
state = nxt
total += reward
recent.append(total)
if len(recent) > window and np.mean(recent[-window:]) >= target:
return ep # solved
return max_ep
for p in (0, 5, 50):
solved = episodes_to_solve(p, seed=0)
print(f"planning_steps={p:>2}: solved in {solved} episodes")
Plot solved against planning_steps and the shape tells the story: zero planning (plain Q-Learning) crawls, five planning steps cuts the episode count dramatically, fifty cuts it further -- but not proportionally further. You don't get ten times the speed-up for ten times the planning. And the reason is worth stating out loud: each real transition carries a fixed amount of genuinely new information, and planning can only ever redistribute that information faster through the Q-table. It can't manufacture facts the agent hasn't witnessed yet. Once the existing experience has been fully propagated, extra planning steps are just re-deriving conclusions the table already holds. That is why the curve flattens -- planning is leverage on what you know, not a substitute for going out and learning more.
Exercise 2: Collect a few thousand random-policy transitions from CartPole-v1, train the EnvironmentModel, then measure compounding error: roll the model forward k steps and compare against the true environment for k = 1, 5, 10, 20.
import gymnasium as gym
import numpy as np
import torch
# Assumes EnvironmentModel and ModelTrainer from episode #110.
def onehot(a, n):
v = np.zeros(n, dtype=np.float32)
v[a] = 1.0
return v
env = gym.make("CartPole-v1")
n_act = env.action_space.n
model = EnvironmentModel(env.observation_space.shape[0], n_act)
trainer = ModelTrainer(model)
# 1. Gather random experience and train the dynamics model.
s, _ = env.reset(seed=0)
for _ in range(5000):
a = env.action_space.sample()
ns, r, term, trunc, _ = env.step(a)
trainer.add_transition(s, onehot(a, n_act), r, ns)
s = ns if not (term or trunc) else env.reset()[0]
for _ in range(3000):
trainer.train_step()
# 2. Roll the model forward k steps and compare to the real env.
for k in (1, 5, 10, 20):
errors = []
for trial in range(200):
s, _ = env.reset(seed=1000 + trial)
pred = torch.FloatTensor(s).unsqueeze(0)
real = s
for _ in range(k):
a = env.action_space.sample()
a_t = torch.FloatTensor(onehot(a, n_act)).unsqueeze(0)
with torch.no_grad():
pred, _ = model(pred, a_t) # imagined step
real, _, term, trunc, _ = env.step(a) # real step
if term or trunc:
break
errors.append(np.linalg.norm(pred.squeeze().numpy() - real))
print(f"k={k:>2}: mean state error = {np.mean(errors):.4f}")
Watch the printed errors climb as k grows, and not linearly -- they accelerate. This is the 0.95 ** k snowball from episode #110, except now you've measured it on a real environment instead of trusting my arithmetic. The model's one-step prediction is excellent; its twenty-step prediction is a fairy tale. Feeding the model its own output as the next input means every small mistake becomes the foundation for the next prediction, and the foundation rots. This is exactly why serious model-based agents keep their imagined rollouts short -- the dream is only trustworthy near the present.
Exercise 3: Build an ensemble of five EnvironmentModels trained on the same data with different seeds, compute the variance of their predictions per state, and argue how you'd use that variance as an uncertainty signal.
import torch
import numpy as np
# Assumes EnvironmentModel, ModelTrainer, and a filled `trainer.buffer` in scope.
def train_one(seed, state_dim, n_act, steps=3000):
torch.manual_seed(seed)
m = EnvironmentModel(state_dim, n_act)
t = ModelTrainer(m)
t.buffer = trainer.buffer # share the same real data
for _ in range(steps):
t.train_step()
return m
state_dim = env.observation_space.shape[0]
ensemble = [train_one(seed, state_dim, n_act) for seed in range(5)]
# For a batch of states + actions, measure how much the 5 models DISAGREE.
states = torch.FloatTensor(np.array([env.reset(seed=s)[0] for s in range(64)]))
actions = torch.stack([torch.FloatTensor(onehot(env.action_space.sample(), n_act))
for _ in range(64)])
with torch.no_grad():
preds = torch.stack([m(states, actions)[0] for m in ensemble]) # (5, 64, dim)
disagreement = preds.var(dim=0).mean(dim=-1) # per-state variance
print("most uncertain states:", disagreement.topk(5).indices.tolist())
The trick is that five models trained on the same data will agree closely where that data was dense (they've all seen plenty of evidence) and scatter wildly where it was sparse (each one is guessing, and they guess differently). High variance across the ensemble is therefore a cheap, self-supervised flag for "we are off the edge of the map here". You'd use it as a brake: while imagining a rollout, watch the ensemble disagreement, and the moment it spikes past some threshold, stop trusting the dream and either truncate the rollout or fall back to a model-free value estimate. No labels required -- the disagreement is the uncertainty. That little idea is the difference between a model-based agent that quietly hallucinates jackpots and one that knows when to keep its mouth shut ;-)
Right -- so notice what every single thing we've built since episode #102 has in common. Q-Learning, SARSA, DQN, REINFORCE, Actor-Critic, PPO, the model-based machinery from last time -- all of it assumes one agent in one world. The agent acts, the world responds, the agent learns. The world is a patient, indifferent backdrop that doesn't have opinions about you.
Today we tear that assumption up. Because the most interesting problems on the planet are not one agent against a backdrop -- they're many agents against each other and alongside each other. Traffic is thousands of drivers, each optimising their own commute. Markets are millions of traders, each trying to outwit the rest. Every board game ever invented is a fight. Even a warehouse full of robots fetching boxes is a team that has to not crash into each other. Multi-agent reinforcement learning (MARL) is what happens when you put more than one learner in the room, and -- fair warning -- it is gloriously, exasperatingly harder than what came before.
There's a single core difficulty in MARL, and if you understand it you understand why the whole field looks the way it does. It's called non-stationarity, and here's the plain-English version.
In single-agent RL, the environment has fixed rules. Pull lever A, get outcome X -- maybe with some randomness, but the distribution of outcomes doesn't change underneath you. That stability is exactly what lets Q-Learning converge: you're estimating a fixed target, like surveying a mountain that stays put while you measure it.
Now drop a second learning agent into that environment. From your perspective, that other agent is part of the environment -- and it is changing its behaviour as it learns. The rules you're trying to estimate are shifting every time your opponent improves. Your best response depends on their strategy; their best response depends on yours; and both are moving. You're surveying a mountain that rearranges itself every time you take a reading. That is non-stationarity, and it quietly breaks the convergence guarantees we relied on. Having said that, it's also what makes MARL fascinating -- the agents are co-authoring the problem they're solving.
Multi-agent problems sort into three buckets, and the bucket decides almost everything about which algorithm you reach for.
Cooperative -- all agents share one reward. A team of robots carrying a heavy table: everybody wins or everybody fails, together. The challenge here isn't conflict, it's coordination and credit assignment (when the team scores, whose actions deserve the credit?).
Competitive -- zero-sum. One agent's gain is exactly another's loss. Chess, Go, poker, two-player anything. What's good for you is by definition bad for your opponent.
Mixed -- partially aligned, partially opposed. This is most of real life. On a motorway everyone shares the goal of arriving safely, but each driver would also quite like to merge in front of you. Cooperation and competition tangled together.
Keep these three in mind, because each of the techniques below is really an answer to "which bucket are we in?".
The simplest thing you can possibly do is pretend the problem isn't multi-agent. Give each agent its own single-agent algorithm and let them all learn in parallel, blissfully unaware of one another. Agent 1 runs DQN, Agent 2 runs DQN, and neither one knows the other exists -- the other agents just look like a slightly weird, twitchy environment.
import torch
import torch.nn as nn
import numpy as np
from collections import deque
import random
class IndependentAgent:
"""Each agent learns on its own with a private Q-network (independent DQN)."""
def __init__(self, obs_dim, n_actions, agent_id, lr=1e-3, gamma=0.99):
self.agent_id = agent_id
self.n_actions = n_actions
self.gamma = gamma
self.epsilon = 1.0
self.q_net = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, n_actions),
)
self.optimizer = torch.optim.Adam(self.q_net.parameters(), lr=lr)
self.buffer = deque(maxlen=50000)
def choose_action(self, obs):
if random.random() < self.epsilon:
return random.randint(0, self.n_actions - 1)
with torch.no_grad():
q = self.q_net(torch.FloatTensor(obs))
return q.argmax().item()
def store(self, obs, action, reward, next_obs, done):
self.buffer.append((obs, action, reward, next_obs, done))
def learn(self, batch_size=64):
if len(self.buffer) < batch_size:
return
batch = random.sample(self.buffer, batch_size)
obs, acts, rews, next_obs, dones = zip(*batch)
obs_t = torch.FloatTensor(np.array(obs))
acts_t = torch.LongTensor(acts)
rews_t = torch.FloatTensor(rews)
next_obs_t = torch.FloatTensor(np.array(next_obs))
dones_t = torch.FloatTensor(dones)
q_vals = self.q_net(obs_t).gather(1, acts_t.unsqueeze(1)).squeeze()
with torch.no_grad():
next_q = self.q_net(next_obs_t).max(dim=1)[0]
targets = rews_t + self.gamma * next_q * (1 - dones_t)
loss = nn.functional.mse_loss(q_vals, targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
And here's the honest truth: this sometimes works. For cooperative tasks where the agents barely interfere with each other, independent learning finds perfectly decent policies, and it's wonderfully simple to implement. But it sits on a cracked foundation -- the experience-replay buffer is the giveaway. A transition you stored ten thousand steps ago was generated against an old version of the other agents, who have since changed. You're learning from stale snapshots of a world that no longer exists. In competitive settings independent learners frequently cycle forever, each one chasing the other's last move like two cats spinning after their tails. It's the right place to start, and the wrong place to stop.
The idea that rescued cooperative MARL is a mouthful but the intuition is lovely: centralized training, decentralized execution (CTDE). During training -- when you're offline, in a simulator, with a god's-eye view -- you let the learning algorithm see everything: every agent's observation, every agent's action, the full global state. But the policies you actually deploy each see only their own local observation. Train together with full information; act alone with partial information. You get the coordination benefits of shared knowledge without the impossible requirement that, at game time, every robot can read every other robot's mind.
The flagship CTDE algorithm for cooperative tasks is QMIX. Each agent keeps its own little Q-network over its local observation. Then a mixing network combines all those individual Q-values into one global Q-value -- and crucially, it's constrained to be monotonic in each agent's Q. Why monotonic? Because then "the action that's best globally" decomposes neatly into "each agent picks the action that's best locally", which is exactly what you need if each agent has to act on its own at execution time.
class QMIXAgent:
"""One agent's private utility network in QMIX."""
def __init__(self, obs_dim, n_actions, hidden=64):
self.q_net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def get_q_values(self, obs):
return self.q_net(obs)
class QMIXMixer(nn.Module):
"""Mixing network: folds individual Q-values into one Q_total.
Weights are forced non-negative -> the monotonicity constraint."""
def __init__(self, n_agents, state_dim, mixing_dim=32):
super().__init__()
self.n_agents = n_agents
# Hypernetworks generate the mixing weights FROM the global state.
self.hyper_w1 = nn.Sequential(
nn.Linear(state_dim, mixing_dim), nn.ReLU(),
nn.Linear(mixing_dim, n_agents * mixing_dim),
)
self.hyper_b1 = nn.Linear(state_dim, mixing_dim)
self.hyper_w2 = nn.Sequential(
nn.Linear(state_dim, mixing_dim), nn.ReLU(),
nn.Linear(mixing_dim, mixing_dim),
)
self.hyper_b2 = nn.Sequential(
nn.Linear(state_dim, mixing_dim), nn.ReLU(),
nn.Linear(mixing_dim, 1),
)
def forward(self, agent_qs, state):
# agent_qs: (batch, n_agents) - each agent's chosen Q-value
# state: (batch, state_dim) - the global state
batch_size = agent_qs.size(0)
# torch.abs keeps the weights non-negative -> monotonic mixing.
w1 = torch.abs(self.hyper_w1(state)).view(batch_size, self.n_agents, -1)
b1 = self.hyper_b1(state).unsqueeze(1)
hidden = torch.relu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1)
w2 = torch.abs(self.hyper_w2(state)).view(batch_size, -1, 1)
b2 = self.hyper_b2(state).unsqueeze(1)
q_total = torch.bmm(hidden, w2) + b2 # -> a single scalar per batch
return q_total.squeeze()
Two things deserve a second look. First, the mixing weights aren't fixed parameters -- they're generated from the global state by little hypernetworks (networks that output the weights of another network). That means the way agents' contributions are combined can change depending on the situation, which is exactly what you want: in some states agent 1 matters more, in others agent 3 does. Second, the torch.abs on every weight is the entire monotonicity trick in one function call. Non-negative weights mean "if any agent's Q goes up, Q_total goes up", and that's the property that lets decentralized greedy action selection match the centralized optimum. Elegant, no? The hard global coordination problem gets smuggled into training, and execution stays trivially simple.
Cooperative settled, let's talk fighting. Competitive games have an awkward chicken-and-egg problem: to train a strong agent you need a strong opponent, but where do you get one before you've trained anything? Worse, if you train against a fixed opponent, your agent doesn't learn to play well -- it learns to beat that one specific opponent, exploiting its particular quirks. Show it a different style and it falls apart.
Self-play is the way out, and it's beautiful in its simplicity: the agent trains against copies of itself.
class SelfPlayTrainer:
"""Train an agent by pitting it against past versions of itself."""
def __init__(self, agent_class, obs_dim, n_actions):
self.agent_class = agent_class
self.current_agent = agent_class(obs_dim, n_actions)
self.opponent_pool = [] # frozen snapshots of past selves
self.save_interval = 100 # snapshot every N episodes
def get_opponent(self):
"""Sample an opponent: usually the latest self, sometimes an old one."""
if not self.opponent_pool or random.random() < 0.8:
opp = self.agent_class(self.current_agent.obs_dim,
self.current_agent.n_actions)
opp.q_net.load_state_dict(self.current_agent.q_net.state_dict())
return opp
return random.choice(self.opponent_pool)
def train_episode(self, env):
opponent = self.get_opponent()
env.reset()
done = False
while not done:
obs_1 = env.get_obs(player=1)
action_1 = self.current_agent.choose_action(obs_1)
obs_2 = env.get_obs(player=2)
action_2 = opponent.choose_action(obs_2)
_, rewards, done = env.step(action_1, action_2)
# Only the current agent learns; the opponent is frozen.
self.current_agent.store(obs_1, action_1, rewards[0],
env.get_obs(player=1), done)
self.current_agent.learn()
def maybe_snapshot(self, episode):
if episode % self.save_interval == 0:
snap = self.agent_class(self.current_agent.obs_dim,
self.current_agent.n_actions)
snap.q_net.load_state_dict(self.current_agent.q_net.state_dict())
self.opponent_pool.append(snap)
This is the engine behind AlphaGo and its descendants. The agent starts out hopeless, plays itself, gets slightly less hopeless, plays the improved version, improves again -- a bootstrap that climbed all the way to superhuman Go. Nota bene the opponent_pool: keeping a stable of historical selves, not just the latest one, is what stops the training from cycling. If you only ever play the current version, the two of you can fall into a rock-paper-scissors loop where each "beats" the last without anyone actually getting better. Forcing the agent to stay strong against every past version is what makes the improvement monotone instead of circular. It's a surprisingly deep little detail dressed up as a list of checkpoints.
In cooperative settings agents often do better if they can share what they each see -- but here's the lovely question: what should they actually say? Rather than hand-designing a communication protocol like some 1970s networking committee, we can just let the agents learn their own language by making the messages differentiable and training them end-to-end with the rest of the policy.
class CommAgent(nn.Module):
"""An agent that learns to emit and consume messages."""
def __init__(self, obs_dim, n_actions, msg_dim=16, hidden=64):
super().__init__()
self.msg_dim = msg_dim
self.obs_encoder = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
)
self.msg_generator = nn.Sequential( # what to broadcast to teammates
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, msg_dim),
)
self.policy = nn.Sequential( # act on obs + received messages
nn.Linear(hidden + msg_dim, hidden), nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def encode(self, obs):
return self.obs_encoder(obs)
def generate_message(self, encoded_obs):
return self.msg_generator(encoded_obs)
def act(self, encoded_obs, received_messages):
if len(received_messages) > 0:
msg_aggregate = torch.stack(received_messages).mean(dim=0)
else:
msg_aggregate = torch.zeros(self.msg_dim)
combined = torch.cat([encoded_obs, msg_aggregate], dim=-1)
return self.policy(combined)
Because the whole pipeline -- encode, message, act -- is differentiable, the gradient from "did we win?" flows all the way back into "what should I have said?". The agents discover, on their own, which signals are worth sending. CommNet and TarMAC are the well-known architectures here, and the eyebrow-raising finding from this line of work is that agents develop their own private languages -- communication patterns no human designed and that we sometimes can't even decode, yet which coordinate the team beautifully. They invent a vocabulary because the vocabulary helps them win. Wowzers.
Here's my favourite thing in all of MARL, and it's not an algorithm -- it's a phenomenon. When you put learning agents under competitive pressure, strategies emerge that nobody designed, wrote down, or anticipated.
The cleanest demonstration is OpenAI's hide-and-seek experiment from 2019. Two teams, hiders and seekers, a handful of moveable boxes and ramps, and dirt-simple rewards: hiders score for staying unseen, seekers score for spotting them. That's it. Over hundreds of millions of episodes, the agents climbed an entire ladder of strategy on their own:
Nobody programmed a single one of those moves. Each was invented by the agents, and each invention was a response to the previous one -- an arms race, run in silico, producing tool-use and counter-strategy out of nothing but competitive pressure and a scoreboard. The same fundamental story produced superhuman Go, StarCraft and Dota agents. This, to me, is the whole promise of MARL in one experiment: you don't engineer the cleverness, you engineer the pressure, and the cleverness grows.
Mixed settings give us social dilemmas -- situations where what's rational for the individual is ruinous for the group. The textbook case is the Prisoner's Dilemma, and it's worth coding up because MARL agents reproduce its famous lesson on their own.
class IteratedPrisonersDilemma:
"""Two agents play repeated Prisoner's Dilemma. 0 = cooperate, 1 = defect."""
def __init__(self):
self.payoffs = {
(0, 0): (3, 3), # both cooperate -> mutual reward
(0, 1): (0, 5), # I cooperate, you defect -> I'm the sucker
(1, 0): (5, 0), # I defect, you cooperate -> I exploit you
(1, 1): (1, 1), # both defect -> mutual punishment
}
def step(self, action_1, action_2):
return self.payoffs[(action_1, action_2)]
Play this once and cold logic says defect: whatever the other does, you score higher by defecting. Two rational agents both reason this way, both defect, and walk away with (1, 1) -- when they could have had (3, 3) by cooperating. Individual rationality, collective stupidity. But play it repeatedly, and something hopeful happens: MARL agents routinely learn to cooperate, because mutual cooperation (3+3 every round) crushes mutual defection (1+1 every round) over a long game. They rediscover tit-for-tat all by themselves -- open friendly, punish a defection, forgive once the other returns to cooperating. The shadow of the future teaches selfish agents to be decent. There's a whole essay about human society hiding in that payoff table, but I'll leave that one to you ;-)
I don't want to oversell this. MARL is difficult, and not just philosophically. The joint action space explodes: 2 agents with 5 actions each give you 25 joint actions; 10 agents give you 5 to the power of 10, which is about 10 million. Credit assignment turns brutal -- the team won, lovely, but which agent's choices actually mattered? Non-stationarity, as we said up top, quietly voids the convergence guarantees. And training can be maddeningly unstable, with agents chasing each other round in circles.
The state of the art handles up to a few hundred agents in simplified worlds. The genuinely enormous problems -- traffic with millions of vehicles, a real financial market -- remain out of reach for end-to-end MARL, though clever approximations chip away at the edges. So treat this episode as a map of the terrain, not a claim that the terrain is conquered. A part from the headline successes (games, mostly), there's a lot of open ground here.
Exercise 1: Take two IndependentAgents and have them play the IteratedPrisonersDilemma for, say, 5000 rounds, feeding each agent the last joint action as its observation (a 4-dim one-hot). Track the cooperation rate over time. Do your independent learners drift toward mutual cooperation, mutual defection, or something oscillating? Run it under three different seeds and report whether the outcome is stable -- this is your first taste of how seed-sensitive MARL can be.
Exercise 2: Implement a minimal SelfPlayTrainer for a trivial symmetric game (rock-paper-scissors is perfect -- one stateless step, three actions). Train with the historical opponent_pool and without it (always playing the latest self). Plot the action distribution over training for both. You should be able to see the no-pool version cycling through rock -> paper -> scissors -> rock, while the pooled version settles toward the uniform (1/3, 1/3, 1/3) mix that no opponent can exploit.
Exercise 3: Using the QMIXMixer, write a tiny numerical check of the monotonicity property. Generate random agent_qs and a random state, compute Q_total, then nudge one agent's Q-value upward by a small amount and recompute. Confirm Q_total never decreases, across a few thousand random trials. Then -- the interesting part -- temporarily delete the torch.abs calls and show the property breaks. You'll have demonstrated, with your own code, exactly why that one function call is load-bearing.
The communication and emergence ideas we just met don't stay theoretical for long -- they're precisely what powers the agents that have started beating the world's best human players at games we once thought were safely ours. That's where we're headed next: taking this multi-agent machinery and pointing it at genuine, hard games, where self-play and emergent strategy stop being curiosities and start winning tournaments. Get the independent-learner and self-play exercises under your fingers now, because the next stop assumes you've felt these dynamics yourself, not just read about them.