RL Practice (2): DQN and Improvements • Xiaohei's Blog

Preface#

In Part 1, we fixed an “episode training skeleton” and validated that a lot of engineering structure can be reused across algorithms with Q-Learning/Sarsa.

This post enters deep reinforcement learning: DQN (Deep Q-Network). Its essence is straightforward:

Use a neural network to approximate $Q(s,a)$ , replacing the old Q-table
To make training stable and improve sample efficiency, introduce two key pieces of equipment:
- Replay Buffer (Experience Replay)
- Target Network

Below, I’ll connect them along the “evolution path” I personally follow in experiments: first get the most basic DQN running, then replace modules one by one, and finally obtain a more stable, less headache-inducing version (Double / Dueling / Noisy / PER). You’ll find these improvements don’t completely change DQN; most of them just add a small shim at a key step: make targets more reliable, make exploration more natural, or make replay “remember what matters”.

My real understanding of DQN started from a failure: the code ran, the loss decreased, but reward didn’t move at all; sometimes reward would rise a bit and then suddenly collapse. Later I realized DQN’s difficulty is never “can you write backprop”, but “is your training signal actually stable”. If the target network isn’t updated properly, replay sampling doesn’t break correlations, or reward scale doesn’t match learning rate, the gradient you get is like groping in noise—drifting further off the path.

So this post tries to explain it in engineering language: why DQN needs Replay, why it needs Target, what the update target is actually doing, and what iteration order you can follow to gradually refine a “runnable DQN” into a “stable DQN”.

DQN: What Changed Compared to Q-Learning?#

If Q-Learning is “walking while filling a Q table”, then DQN is “I don’t fill the table; I train a function to fit the table”. To make this function (a neural network) train stably, I usually remember DQN’s changes as three things:

Replace the table with a network: $Q(s,a;\theta)$ , input is state, output is Q-values for all actions.
Replay Buffer: store interaction data; during updates randomly sample a batch to improve sample efficiency and break correlations.
Policy network + target network: update the policy network online with $\theta$ , and use the target network $\theta^-$ to compute targets; periodically copy parameters to stabilize training.

A Reusable DQN Agent Interface#

Same as Part 1:

sample(state): for training (usually epsilon-greedy)
predict(state): for evaluation (argmax)
update(): sample a batch from Replay Buffer to update the network

Note: starting from DQN, update() often no longer takes a single transition. Instead:

push(transition) into replay
then, at suitable times (buffer is large enough, interval reached), call update()

I strongly recommend separating “push interaction data” and “update if conditions are met”. Because later, whether you add PER (sampling changes), Noisy (exploration changes), or Double (target algorithm changes), the main training loop doesn’t need major changes; modifications stay inside the buffer or update logic.

Replay Buffer: push + sample Is Enough#

I usually compress Experience Replay into two methods:

push: store transitions in order; when full, drop the oldest
sample: randomly sample a batch

A minimal structure is typically:

import random
from collections import deque


class ReplayBuffer:
	def __init__(self, capacity: int):
		self.buffer = deque(maxlen=capacity)

	def push(self, transition):
		self.buffer.append(transition)

	def sample(self, batch_size: int):
		batch = random.sample(self.buffer, batch_size)
		# real code will unzip + convert to tensors here
		return batch

	def __len__(self):
		return len(self.buffer)

python

DQN update: Where Does the Loss Come From?#

When I implemented DQN for the first time, the most confusing part wasn’t backprop, but “how exactly should the target be computed”. Once you understand it, the update becomes mechanical: compute a target from a batch, then make the network outputs match that target.

In the most basic DQN, the loss is the mean squared error between the “expected value $y_i$ ” and the “actual value $Q(s_i,a_i;\theta)$ ”.

target (Expected Value)#

Basic DQN commonly uses:

y_i = r_i + \gamma \max_{a'} Q(s'_i, a'; \theta^-)

And you must handle terminal states: if terminated==True, there is no next state, so set $y_i=r_i$ .

loss (Mean Squared Error)#

\mathcal{L}(\theta)=\frac{1}{N}\sum_i (y_i - Q(s_i,a_i;\theta))^2

Then as usual: define an optimizer, call loss.backward(), then optimizer.step().

Dueling DQN: Split the Q Network into V + A#

In some environments (especially when states are complex, but action effects aren’t obvious in some phases), I find it hard to learn “which action is better”, while “how valuable this state is overall” matters more. Dueling’s intuition is to learn these separately: learn the state value $V(s)$ first, then learn the relative advantage $A(s,a)$ .

Value: estimate state value $V(s)$
Advantage: estimate relative advantage $A(s,a)$ for each action

Then combine them into $Q(s,a)$ .

In engineering practice, you only need to care about two things:

The network forward output changes from “directly output Q” to “output V and A, then compose Q”
Everything else (Replay, target, update) is basically unchanged

Double DQN: Mitigate Overestimation#

A classic issue of basic DQN is overestimation: because we use the same network (or the same estimation process) to both “choose the max action” and “evaluate the value of that max action”, it’s easy to treat noise as truth.

The key change of Double DQN is:

Use the policy network to choose the action (argmax)
Use the target network to evaluate the value of that action

Engineering translation: only change how the target is computed; everything else stays.

Noisy DQN: Use Learnable Noise for Exploration#

Noisy DQN focuses on the “model definition”:

Introduce mu/sigma parameters into Linear layers
Inject noise on each forward pass (during training) and support reset

Its advantage is that in many tasks, it’s less troublesome than manually tuning epsilon.

You can understand it as:

Instead of injecting randomness from an external policy (epsilon-greedy), let the network learn when it should be more “jittery”.

PER-DQN: Prioritized Experience Replay (SumTree)#

I usually break PER’s engineering essentials into two parts:

SumTree: manage sample priorities in $O(\log n)$ and sample by priority
Importance Sampling: use weights to correct the bias introduced by non-uniform sampling

In implementation, you can split into three classes:

SumTree: maintain the binary tree and priority updates
ReplayTree: a replay buffer built on SumTree (push, sample, batch_update)
PERDQNAgent: after update, write TD-error back to replay to update priorities

If you want a “minimal usable” PER first, my advice is: don’t overcomplicate it from the start. Get SumTree right and get the priority write-back logic in batch_update right, and you’ll already see clear gains. Then add importance-sampling weights and beta annealing later for a smoother path.

Summary#

This post connects the “program structure” of the DQN family:

DQN’s stability comes from Replay + Target
Double/Dueling change the target computation or the network structure
Noisy changes how exploration is implemented
PER changes the sampling distribution of replay

In the next post I’ll move to policy gradients and Actor-Critic: REINFORCE, PPO, A2C. You’ll find their core interface can still be reused; only the optimized object inside update() changes from $Q$ to $\pi_\theta$ .

My Usual DQN Hyperparameter/Debugging Checklist (In Order)#

If your DQN “doesn’t converge / is unstable / Q-values explode”, I usually check in this order, which rarely leads you astray:

Check reward scale first, then set learning rate: if reward is in $[0,1]$ , you can use a larger learning rate; if rewards are tens or hundreds, consider reward clipping or lowering learning rate by two orders.
Print Q-value ranges: every N episodes, print q_mean/q_max (batch-based). If Q-values jump from tens to tens of thousands, it’s usually a target computation / terminal handling / learning rate issue.
Ensure replay is “large enough + random sampling”: if the buffer is too small and you update, training becomes very unstable; I usually set a min_buffer_size, and before that only push and don’t update.
Don’t update the target network too aggressively: too frequent updates make the target useless; too slow makes learning sluggish. A common starting point is syncing every 500~2000 env steps (varies by environment).
Don’t decay epsilon too fast: if reward fluctuates early and quickly stalls, suspect insufficient exploration first; it’s better to decay slowly than be “confidently random”.
Batch size, gamma, and update frequency should match: too small batch makes gradient noisy; too large gamma without enough episode length/steps can prevent learning.
Run the base version first before fancy add-ons: Double/Dueling/Noisy/PER are improvements, but if you haven’t confirmed “target correct + done handling correct + replay normal”, adding improvements usually makes things messier.