RL Notes (7): Advanced DQN • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

Once DQN starts running, you’ll quickly run into an annoying kind of “stability pain”:

reward goes up a bit and then drops back down;
Q values keep getting larger, but performance doesn’t improve;
training is very slow, like spinning in place.

When you look at papers and codebases, you’ll see a bunch of DQN variants: DDQN, Dueling, PER, NoisyNet…

I strongly recommend not treating them as “terms to memorize,” but as the docs do: focus on the problems they solve—each trick targets a specific pain point.

1) Double DQN: mitigate Q overestimation#

The docs point out the first issue directly: Q values are often overestimated.

The reason is intuitive: you use the same network to both select actions and estimate values, so noise gets amplified by the max operation.

Double DQN does this:

use the online network to select the action: $a^* = \arg\max_a Q_\theta(s',a)$
use the target network to evaluate: $Q_{\theta^-}(s', a^*)$

This separates “select” and “evaluate,” and overestimation is noticeably reduced.

2) Dueling DQN: spend representational power where it matters#

The docs describe the Dueling structure: instead of outputting $Q$ directly, it splits into two streams:

$V(s)$ : how valuable the state itself is
$A(s,a)$ : how much better an action is relative to others in that state

Then combine them into $Q(s,a)$ .

Intuition: in many states, action differences are small while state quality differences are large; Dueling learns “state value” more efficiently.

3) Prioritized Experience Replay (PER): spend your learning budget on the most useful samples#

A uniform replay buffer samples evenly, but many samples carry little information.

PER’s core idea:

samples with large TD error are more worth learning from
learning more from them makes convergence faster

Of course it introduces bias, so importance-sampling weights are typically used to correct it.

4) NoisyNet: finer exploration than ε-greedy#

The docs mention noisy networks: add noise in parameter space to improve exploration.

Pros:

exploration becomes state-dependent
no need to manually design an ε decay schedule

Intuition: instead of “occasionally randomizing actions,” make the policy itself slightly stochastic.

Chapter summary: suggested order of tricks#

If you want to add these tricks gradually in a project, I recommend a rhythm of “plug the big leaks first, then chase the ceiling”: Double DQN is almost a must-have because the improvement on overestimation is often immediate; Dueling usually requires small changes and gives relatively stable gains, making it a good second step; PER can yield big gains but is more sensitive, so introduce it after the base version is stable; NoisyNet is more like an exploration upgrade, and is especially valuable on tasks where exploration is very hard.

Next chapter we’ll face a tougher problem: continuous actions. You’ll find DQN’s “max over actions” is almost unusable in continuous action spaces—so you either approximate it via sampling/optimization, or switch routes entirely (Actor-Critic).