RL Notes (5): PPO • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Preface#

If REINFORCE feels like “it can learn but it’s very shaky,” PPO often feels like “finally this looks like an engineering algorithm.”

PPO addresses a very practical problem:

On-policy methods require very fresh data, but the cost is that sampling gets consumed like a bottomless pit; off-policy methods can reuse data and have high sample efficiency, but once updates push the policy too far away from the data distribution, training can easily collapse. PPO essentially balances these two needs: it wants to use data more thoroughly, but it doesn’t want the policy to take too large a step each update.

So the docs first introduce importance sampling, and then arrive at the key point of PPO: limit the gap between the new and old policies.

Importance sampling: estimate a new policy from old data#

The docs say “use importance sampling to switch from same-policy to different-policy.” Intuitively:

data is collected by the old policy $\pi_{\theta'}$ ;
but we want to optimize the new policy $\pi_\theta$ ;
use a ratio to compensate for distribution mismatch:

r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}

The issue is: if the new and old policies differ too much, this ratio can explode or go to 0, making updates highly unstable.

PPO’s core: don’t update too aggressively#

The docs describe PPO’s goal: avoid $p_\theta(a|s)$ and $p_{\theta'}(a|s)$ being too different.

TRPO enforces a KL-divergence constraint (but is complex to implement); PPO puts the “constraint” into the objective function, making optimization look like ordinary gradient methods.

The most common PPO objective is the clipped objective (the docs don’t expand the full formula, but the idea is the same):

\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}\big[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)\big]

Here $A_t$ is the advantage (usually from a critic or GAE).

PPO practical points (I think more important than formulas)#

1) PPO is still on-policy#

The docs emphasize: even though importance sampling is used, PPO usually only uses data from the previous policy iteration. So the behavior policy and target policy are very close and can be treated as effectively the same policy.

This is a constraint you must follow when coding:

the sampling buffer cannot accumulate many iterations like DQN;
you can reuse at most K epochs (multi-epoch mini-batch optimization), but it’s still based on the same batch of rollouts.

2) Advantage estimation determines stability#

PPO itself is just an “update constraint”; what really determines the learning signal quality is $A_t$ .

Common engineering practice:

use a critic to estimate $V(s)$
use GAE( $\lambda$ ) to compute advantages

[Image placeholder: GAE computation flowchart (TD residual decayed accumulation)]

3) Two knobs people tune the most#

clip epsilon (e.g., 0.1 ~ 0.3)
entropy bonus coefficient (encourage exploration, avoid premature collapse)

Chapter summary: PPO’s contribution is “controllability”#

In my view, PPO’s most important contribution is not how “new” it is, but that it makes policy-gradient updates controllable:

you can constrain update magnitude via KL / clip;
you can reuse the same batch of data for multiple epochs to improve sample utilization;
it naturally fits the actor-critic architecture.

Next chapter we’ll return to the value-based camp: DQN. You’ll find it also does “stability engineering,” but with completely different tools: experience replay + target networks.