

Preface#
Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!
Start#
Tabular Q-learning is beautiful, but it has a fatal premise: must fit in a table.
Once the state is an image, a continuous vector, or the combinatorics explode, tabular methods go bankrupt. DQN’s idea is simple:
Use a neural network to approximate .
But when you actually implement it, you’ll find that directly applying the supervised-learning recipe becomes very unstable. The docs mention two key engineering components of DQN:
- target network
- experience replay
These two are almost the reason DQN can run.
What DQN is: deep Q-learning#
The definition in the docs is standard: DQN is Q-learning based on deep learning, combining value-function approximation with neural networks.
We’re still learning , and we still choose actions greedily or via ε-greedy:
The difference is that is no longer a table, but a network .
Why it becomes unstable: bootstrapping + non-i.i.d. samples + moving targets#
DQN’s training target is usually the TD target:
If you use the same network to simultaneously:
- predict
- generate the target
then the target will move as you train (a moving target), and divergence is easy.
Target Network#
The docs mention using a target network as the first stability patch:
- online network:
- target network: (parameters copied with delay)
Common practice:
- every N steps, copy to
- or do soft updates (smoother)
Experience Replay (Replay Buffer)#
The docs mention experience replay. It addresses the fact that samples are temporally correlated; direct online updates can bias gradients.
The replay buffer helps in two ways:
On one hand, randomly sampling mini-batches breaks temporal correlation so gradients look more like “approximately i.i.d.” supervised learning; on the other hand, it reuses the same experiences multiple times, improving sample efficiency—especially critical when environment interaction is expensive.
Engineering experience:
- a buffer that’s too small overfits to recent experience
- a buffer that’s too large becomes “too offline,” and learning slows down
DQN training loop (pseudo-code)#
You can view it as a loop of three things: sample, store, learn.
- choose actions from using ε-greedy
- interact with the environment to get
- store into the replay buffer
- sample a batch from the buffer, construct TD targets, and regress with MSE
Chapter summary: DQN is the beginning of “stability engineering”#
From this chapter on, you’ll notice that many innovations in deep RL are essentially addressing the same problem: training instability.
Next chapter we continue along DQN: Double / Dueling / PER / NoisyNet… Each “variant” targets a specific pain point.