Learn AI Series):Before we put a leash on the policy gradient, let's clear last episode's three exercises. All of them build on the PolicyNetwork, REINFORCE, REINFORCEWithBaseline and ActorCritic classes from episode #108, so I'm assuming those are imported and sitting in scope. I'm also leaning on gymnasium throughout -- pip install gymnasium if you skipped it last time.
Exercise 1: Implement plain REINFORCE (no baseline) on CartPole-v1, train it for 1,000 episodes logging the 100-episode moving average, then run it three times under different seeds and plot all three curves together -- so you can see the variance problem with your own eyes.
import gymnasium as gym
import numpy as np
import torch
# Assumes PolicyNetwork and REINFORCE from episode #108.
def train_reinforce(seed, n_episodes=1000):
env = gym.make("CartPole-v1")
torch.manual_seed(seed)
np.random.seed(seed)
agent = REINFORCE(env.observation_space.shape[0], env.action_space.n)
rewards = []
for ep in range(n_episodes):
state, _ = env.reset(seed=seed + ep)
done, trunc, total = False, False, 0.0
while not (done or trunc):
action = agent.choose_action(state)
state, reward, done, trunc, _ = env.step(action)
agent.store_reward(reward)
total += reward
agent.learn() # Monte Carlo update, once per episode
rewards.append(total)
return rewards
def moving_average(x, w=100):
return np.convolve(x, np.ones(w) / w, mode="valid")
curves = [moving_average(train_reinforce(s)) for s in (0, 1, 2)]
for s, c in enumerate(curves):
print(f"seed {s}: final avg-100 = {c[-1]:6.1f} | peak = {c.max():6.1f}")
Plot the three curves on one axis and the lesson is impossible to miss: they wander all over the place. One seed might claw its way to CartPole's ceiling of 500 by episode 400 and stay there; another sputters around 80 for the whole run; a third climbs nicely and then collapses back down for no visible reason. That spread between three runs of the exact same algorithm is the variance problem made visual. Contrast it with the three DQN runs from episode #107, which would sit almost on top of one another -- experience replay and a frozen target buy you a steadiness that raw Monte Carlo policy gradient simply does not have. Same task, wildly different reliability.
Exercise 2: Add the learned value baseline (turn REINFORCE into REINFORCEWithBaseline), train both under the same seeds, and quantify the improvement -- roughly how many episodes does each take to first reach a 100-episode average of 195?
import gymnasium as gym
import numpy as np
import torch
# Assumes REINFORCE and REINFORCEWithBaseline from episode #108.
def episodes_to_solve(ctor, seed, target=195.0, n_episodes=1000):
env = gym.make("CartPole-v1")
torch.manual_seed(seed)
np.random.seed(seed)
agent = ctor(env.observation_space.shape[0], env.action_space.n)
rewards, hit = [], None
for ep in range(n_episodes):
state, _ = env.reset(seed=seed + ep)
done, trunc, total = False, False, 0.0
while not (done or trunc):
action = agent.choose_action(state)
state, reward, done, trunc, _ = env.step(action)
agent.store_reward(reward)
total += reward
agent.learn()
rewards.append(total)
if hit is None and len(rewards) >= 100 and np.mean(rewards[-100:]) >= target:
hit = ep # first episode the running avg clears 195
return hit
for name, ctor in [("REINFORCE", REINFORCE),
("baseline", REINFORCEWithBaseline)]:
hits = [episodes_to_solve(ctor, s) for s in (0, 1, 2)]
print(f"{name:>10}: episodes-to-195 per seed = {hits}")
The baseline version gets there sooner and -- more tellingly -- gets there at all on seeds where plain REINFORCE never does (you'll see a None or two creep into the bare version's row). Why? Because subtracting V(s) swaps the raw return G_t for the advantage G_t - V(s_t), and the advantage has a far smaller spread around zero. We proved last time that a state-only baseline leaves the gradient unbiased -- the expected update is unchanged -- so all you lose is noise. Less wobble in the gradient means a straighter climb to 195. Cleaner signal, same destination, fewer wrong turns.
Exercise 3: Add an entropy bonus to the ActorCritic agent -- compute dist.entropy() each step and add -beta * entropy to the loss -- then compare beta = 0 against beta = 0.01.
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
# Assumes ActorCritic from episode #108.
def train_with_entropy(beta, seed, n_episodes=800):
env = gym.make("CartPole-v1")
torch.manual_seed(seed)
np.random.seed(seed)
agent = ActorCritic(env.observation_space.shape[0], env.action_space.n)
rewards = []
for ep in range(n_episodes):
state, _ = env.reset(seed=seed + ep)
done, trunc, total = False, False, 0.0
while not (done or trunc):
state_t = torch.FloatTensor(state).unsqueeze(0)
feats = agent.features(state_t)
dist = Categorical(F.softmax(agent.actor(feats), dim=-1))
action = dist.sample()
value = agent.critic(feats).squeeze()
next_state, reward, done, trunc, _ = env.step(action.item())
with torch.no_grad():
nfeats = agent.features(torch.FloatTensor(next_state).unsqueeze(0))
next_value = agent.critic(nfeats).squeeze()
target = reward + agent.gamma * next_value * (1 - float(done or trunc))
advantage = target - value
actor_loss = -dist.log_prob(action) * advantage.detach()
critic_loss = advantage.pow(2)
entropy = dist.entropy().mean()
loss = actor_loss + 0.5 * critic_loss - beta * entropy # the new term
agent.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(agent._all_params, 0.5)
agent.optimizer.step()
state, total = next_state, total + reward
rewards.append(total)
return rewards
for beta in (0.0, 0.01):
finals = [np.mean(train_with_entropy(beta, s)[-100:]) for s in (0, 1, 2)]
print(f"beta={beta:<4}: final avg-100 = {np.mean(finals):6.1f} "
f"+/- {np.std(finals):.1f}")
With beta = 0 the agent occasionally does something maddening: early on it stumbles into one action that looks great for a few episodes, the softmax piles probability onto it, and the policy collapses -- it stops exploring and gets stuck on a mediocre habit it can no longer escape. The -beta * entropy term fights exactly that. Because the optimiser is now mildly rewarded for keeping the action distribution spread out, it resists that premature certainty, and across the three seeds the final scores cluster tighter and higher. This is the same exploration-versus-exploitation tension we first wrestled with on the bandits back in episode #103 -- only here, in stead of an epsilon knob bolted on the outside, exploration is encouraged from inside the loss itself. Tidy, that ;-)
Right -- episode 109, and this is the one where we stop letting the policy gradient run wild and put a proper safety rail on it.
Cast your mind back over episode #108. REINFORCE and Actor-Critic both do the same brave, slightly reckless thing: they compute whatever gradient the data suggests and take a step in that direction. And most of the time that's fine. But there is no seatbelt. One unlucky batch, one step a touch too big, and a policy that took thousands of episodes to learn can be wrecked in a single update. Worse, the wreck feeds on itself: a broken policy generates broken data, which produces worse updates, which generate even worse data. A death spiral, and you sit there watching your reward curve nose-dive off a cliff with no idea why.
Trust region methods are the fix, and the core idea is almost suspiciously simple: make the biggest improvement you can, but only within a region where you still trust your gradient estimate. Step boldly inside that region, never outside it. Today we'll see the two algorithms that turned that idea into practice -- TRPO and PPO -- and we'll build the second one from scratch, because PPO is, no exaggeration, the most important single algorithm in modern reinforcement learning.
Let me make the danger concrete. Picture an agent that has learned to walk a tightrope near a cliff edge -- efficient, but with no margin for error. The gradient looks at the current policy and says: "lean a hair to the right, it's very slightly faster." Sound advice... for a tiny step. But a policy gradient step is not guaranteed to be tiny. Multiply that gradient by a learning rate that happened to be a bit large, and the agent doesn't lean a hair to the right -- it lunges, and walks straight off the edge.
Here's the crux, and it's worth saying slowly. A policy gradient is only reliable for small policy changes. It tells you the direction of improvement at the current policy -- not the direction three big steps away, where the landscape may look completely different. This is the very same trouble that made us care about learning rates and schedules back in episode #40, but in reinforcement learning it bites much harder. Why? Because in supervised learning your dataset sits still while you train. In RL the policy generates its own data -- change the policy too much and you don't just mis-step, you start collecting experience from a part of the world your gradient knew nothing about. The ground shifts under your feet because you shifted it.
So the question becomes: how do we take the largest useful step while guaranteeing we stay in the region where the gradient still tells the truth? That guarantee is what TRPO set out to provide.
TRPO (Schulman et al., 2015) formalises the trust-region idea. In stead of an ordinary gradient step, it solves a small constrained optimization problem at every update:
maximize E[ (pi_new(a|s) / pi_old(a|s)) * A(s, a) ]
subject to KL(pi_old || pi_new) <= delta
Two pieces, and both repay a careful read.
The thing being maximised is the surrogate advantage: the advantage A(s, a) (how much better an action did than expected -- the very quantity we built last episode) weighted by a probability ratio. That ratio,
r(theta) = pi_new(a|s) / pi_old(a|s)
measures how much more, or less, likely the new policy is to take the same action the old one took. If r = 1 the two policies agree perfectly on that action; if r = 2 the new policy is twice as keen; if r = 0.5 it's half as keen. Maximising r * A does the obvious sensible thing -- crank up the probability of actions that had positive advantage, crank down the ones that had negative advantage.
The second piece is the leash. KL(pi_old || pi_new) is the Kullback-Leibler divergence -- a standard measure of how different two probability distributions are (we first met KL in the context of distributions long ago; here it's quantifying "how far has the policy moved?"). The constraint <= delta says: change the policy as much as you like to boost that surrogate, but not so much that the new policy diverges from the old one by more than a small budget delta. That, right there, is the trust region, written in math.
TRPO genuinely works -- it was a real milestone. But it is a beast to implement. Solving that constrained problem means estimating the Fisher information matrix and running conjugate-gradient steps with a line search, all by hand. It's the kind of code you write once, get subtly wrong twice, and never quite enjoy maintaining. Surely, people thought, there's a cheaper way to get the same well-behaved updates? There was. It's called PPO.
PPO (Schulman et al., 2017) reaches the same destination as TRPO -- updates that never lurch too far -- with a mechanism so simple it feels like cheating: it just clips the probability ratio. No constraint, no Fisher matrix, no second-order anything. Plain first-order gradient descent, the same kind we've used since episode #7.
The clipped surrogate objective is this:
L_CLIP = E[ min( r(theta) * A, clip(r(theta), 1 - eps, 1 + eps) * A ) ]
where eps is a small number, typically 0.2. Let me unpack the min and the clip, because the interplay of those two is the entire trick.
clip caps the ratio at 1 + eps. Past that ceiling the objective goes flat -- pushing the action's probability even higher buys no more reward in the surrogate, so the gradient vanishes. The policy is allowed to become more keen on a good action, but only up to a point.clip floors the ratio at 1 - eps. Again the objective flattens out, so the policy can back away from a bad action but not slam away from it in one go.The min is what makes the clipping bite only in the dangerous direction (it always takes the more pessimistic of the clipped and unclipped terms, so the update can never exploit the clip to take a bigger step than the raw objective would). The net effect: a flat region in the loss landscape just outside the trust region, where the gradient is zero. The policy is free to move within [1 - eps, 1 + eps] of the old one, and gets no encouragement whatsoever to move beyond it. Same trust-region behaviour as TRPO, conjured out of a one-line clamp. Holy Macaroni, it's elegant.
Enough theory -- let's build the thing. We'll need a network with an actor head and a critic head (sharing a trunk, exactly the pattern from last episode), a buffer to collect a chunk of experience, the clipped update, and a training loop. We start with the network:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
import numpy as np
class PPONetwork(nn.Module):
"""Shared trunk with separate actor (policy) and critic (value) heads."""
def __init__(self, state_dim, n_actions, hidden=64):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh(),
)
self.actor = nn.Linear(hidden, n_actions) # logits over actions
self.critic = nn.Linear(hidden, 1) # state-value estimate
def forward(self, state):
features = self.shared(state)
return self.actor(features), self.critic(features)
def get_action(self, state):
state_t = torch.FloatTensor(state).unsqueeze(0)
logits, value = self.forward(state_t)
dist = Categorical(logits=logits)
action = dist.sample()
return action.item(), dist.log_prob(action), value.squeeze()
A small detail worth noticing: PPO conventionally uses Tanh activations in the trunk in stead of the ReLU we reached for in episode #108. It's not load-bearing, but Tanh tends to keep activations bounded, which pairs nicely with the careful, bounded updates PPO is all about. Little choices like that are quite some of what separates "works in the paper" from "works on your machine".
PPO doesn't update after every step (too noisy) nor only at episode's end (too slow). It collects a fixed-length rollout -- a couple of thousand steps, say -- and then learns from that whole batch. The buffer stores the experience and, crucially, computes the advantages using GAE:
class RolloutBuffer:
"""Stores a fixed rollout and computes GAE advantages."""
def __init__(self):
self.states, self.actions, self.log_probs = [], [], []
self.rewards, self.values, self.dones = [], [], []
def store(self, state, action, log_prob, reward, value, done):
self.states.append(state)
self.actions.append(action)
self.log_probs.append(log_prob)
self.rewards.append(reward)
self.values.append(value)
self.dones.append(done)
def compute_gae(self, last_value, gamma=0.99, lam=0.95):
"""Generalized Advantage Estimation -- a lambda-weighted TD blend."""
advantages, gae = [], 0.0
values = self.values + [last_value]
for t in reversed(range(len(self.rewards))):
delta = (self.rewards[t]
+ gamma * values[t + 1] * (1 - self.dones[t])
- values[t])
gae = delta + gamma * lam * (1 - self.dones[t]) * gae
advantages.insert(0, gae)
advantages = torch.FloatTensor(advantages)
returns = advantages + torch.FloatTensor(self.values)
return returns, advantages
def batches(self, returns, advantages, batch_size=64):
"""Yield shuffled minibatches for several epochs of updates."""
n = len(self.states)
idx = np.arange(n)
np.random.shuffle(idx)
states = torch.FloatTensor(np.array(self.states))
actions = torch.LongTensor(self.actions)
old_log_probs = torch.stack(self.log_probs).detach()
for start in range(0, n, batch_size):
b = idx[start:start + batch_size]
yield (states[b], actions[b], old_log_probs[b],
returns[b], advantages[b])
def clear(self):
self.__init__()
GAE (Schulman et al., 2015 -- the same group, busy year) deserves a paragraph of its own, because it's the quiet hero of practical policy gradients. Remember the delta = r + gamma * V(s') - V(s) TD error from episode #106? GAE blends those one-step TD errors across many steps with an exponential weight lam (lambda). The dial does exactly what the n-step dial did back then: lam = 0 gives you the pure one-step TD advantage (low variance, but biased by the critic's mistakes), lam = 1 gives you the full Monte Carlo advantage (unbiased, but high variance). A value like 0.95 lives in the sweet spot we found empirically last time -- most of the bias gone, most of the variance tamed. It is, quite literally, the policy-gradient cousin of the n-step idea, and the same U-shaped trade-off governs both.
Now the heart of it -- the clipped update, run for several epochs over the collected rollout:
class PPOAgent:
"""PPO with the clipped surrogate objective."""
def __init__(self, state_dim, n_actions, lr=3e-4, gamma=0.99, lam=0.95,
clip_eps=0.2, epochs=4, batch_size=64,
entropy_coef=0.01, value_coef=0.5):
self.gamma, self.lam = gamma, lam
self.clip_eps, self.epochs = clip_eps, epochs
self.batch_size = batch_size
self.entropy_coef, self.value_coef = entropy_coef, value_coef
self.net = PPONetwork(state_dim, n_actions)
self.opt = torch.optim.Adam(self.net.parameters(), lr=lr)
self.buffer = RolloutBuffer()
def choose_action(self, state):
action, log_prob, value = self.net.get_action(state)
return action, log_prob, value.item()
def update(self, last_value):
returns, advantages = self.buffer.compute_gae(
last_value, self.gamma, self.lam)
# normalise advantages -- a small trick that helps a lot
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(self.epochs):
for (states, actions, old_log_probs,
b_returns, b_adv) in self.buffer.batches(
returns, advantages, self.batch_size):
logits, values = self.net(states)
dist = Categorical(logits=logits)
new_log_probs = dist.log_prob(actions)
entropy = dist.entropy().mean()
ratio = torch.exp(new_log_probs - old_log_probs) # r(theta)
surr1 = ratio * b_adv
surr2 = torch.clamp(ratio, 1 - self.clip_eps,
1 + self.clip_eps) * b_adv
actor_loss = -torch.min(surr1, surr2).mean() # the clip
critic_loss = F.mse_loss(values.squeeze(), b_returns)
loss = (actor_loss
+ self.value_coef * critic_loss
- self.entropy_coef * entropy)
self.opt.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.net.parameters(), 0.5)
self.opt.step()
self.buffer.clear()
Three things to flag. First, ratio = exp(new_log_probs - old_log_probs) -- we compute the probability ratio in log-space and exponentiate, which is numerically far kinder than dividing two probabilities directly (a habit worth keeping everywhere in ML). Second, the same entropy bonus you wired into Actor-Critic in exercise 3 reappears here as - entropy_coef * entropy, doing the same job: keeping the policy from collapsing too soon. Third -- and this is the bit that makes PPO sample-efficient -- we loop over the same rollout for several epochs. Each batch of hard-won experience gets squeezed for multiple gradient steps, and the clipping is precisely what makes that safe: without it, reusing data several times would march the policy miles away from the distribution that generated it.
The loop just alternates collecting a rollout with updating on it:
def train_ppo(env, agent, total_steps=200_000, rollout_len=2048):
state, _ = env.reset()
episode_reward, history, steps = 0.0, [], 0
while steps < total_steps:
for _ in range(rollout_len): # collect a fixed rollout
action, log_prob, value = agent.choose_action(state)
next_state, reward, done, trunc, _ = env.step(action)
agent.buffer.store(state, action, log_prob, reward, value,
float(done or trunc))
episode_reward += reward
steps += 1
if done or trunc:
history.append(episode_reward)
episode_reward = 0.0
state, _ = env.reset()
else:
state = next_state
with torch.no_grad(): # bootstrap the cut-off tail
_, last_value = agent.net(
torch.FloatTensor(state).unsqueeze(0))
agent.update(last_value.item())
if history:
print(f"steps {steps:>7} | avg reward (last 20) "
f"{np.mean(history[-20:]):6.1f}")
return history
Point train_ppo(gym.make("CartPole-v1"), PPOAgent(4, 2)) at the pole and watch it climb to 500 and stay there -- none of the heart-stopping collapses you saw from bare REINFORCE in exercise 1. Same problem, same hardware, a vastly calmer ride. That calmness is the whole product.
Let me be blunt about why PPO is the default you reach for unless you have a specific reason not to:
torch.clamp and .backward(). You can read the whole update in one sitting.That last point is not hypothetical. PPO is the algorithm OpenAI used for the RLHF step that aligned ChatGPT (the human-feedback machinery we met in episode #61). It's a staple in robotics labs. It's the thing that gets quietly swapped in when a fancier method proves too fiddly. When you don't have a strong reason to pick something exotic, you pick PPO -- and you're usually right to.
PPO has a reputation for "just working", but it has a handful of knobs that genuinely move the needle. Here are the ones worth knowing:
| Parameter | Typical range | What it does |
|---|---|---|
clip_eps | 0.1 - 0.3 | Width of the trust region. Smaller = more conservative updates |
epochs | 3 - 10 | Gradient passes per rollout. More = more sample-efficient, but risks over-fitting the batch |
rollout_len | 128 - 2048 | Steps gathered before each update. Longer = lower-variance advantages |
lam (GAE) | 0.9 - 0.99 | Bias-variance of the advantage. Higher = less bias, more variance |
entropy_coef | 0.0 - 0.05 | Exploration pressure. Higher = the policy stays more random for longer |
lr | 1e-4 - 3e-4 | Learning rate, often linearly decayed toward zero over training |
batch_size | 32 - 256 | Minibatch size for the epoch loop |
If you only remember two: clip_eps is your trust-region width, and entropy_coef is your insurance against premature collapse. Get those sane and PPO is forgiving about the rest.
[1 - eps, 1 + eps], so the objective goes flat (zero gradient) the moment the policy tries to move too far -- no second-order machinery at all;Exercise 1: Get the PPOAgent from this episode training on CartPole-v1 and plot the per-rollout average reward. Then run an ablation on the clip: set clip_eps to something enormous like 100.0 so the clamp never triggers, and train again under the same seed. Describe how the two reward curves differ, and connect the unstable one back to the "no seatbelt" problem that opened this episode -- you are essentially turning PPO back into a multi-epoch REINFORCE-with-baseline, and it should show.
Exercise 2: Add an approximate-KL early stop to the update. After each epoch, estimate the mean KL between the old and new policies with the cheap formula mean(old_log_probs - new_log_probs), and if it exceeds a threshold (say 0.015), break out of the epoch loop before doing more updates. Log how often the early stop fires over a full training run, and explain how this re-introduces a piece of TRPO's explicit KL constraint on top of PPO's implicit clip.
Exercise 3: Adapt PPO to a continuous action space and run it on Pendulum-v1. Replace the Categorical distribution with a Normal: have the actor head output a mean (and learn a log_std parameter), sample actions from that Gaussian, and use dist.log_prob(action).sum(-1) for the ratio. Note carefully what had to change (the distribution, the action shape, the output layer) and -- more interestingly -- everything that stayed exactly the same (the clip, GAE, the epoch loop, the value loss). That invariance is the whole reason PPO travels so well across problem types.
That continuous-action version is your bridge to the harder environments coming up -- the ones where an agent doesn't just react to the world but starts trying to model it, or where several agents have to learn in each other's company. We've now got the core policy-optimization engine built and understood. Next we start pointing it at bigger, stranger worlds ;-)