Xiaohei's Blog
headpicBlur image

Preface#

Once you get to continuous control (continuous action space), you’ll find that the DQN approach of “outputting Q values for discrete actions” is no longer that convenient.

The biggest feeling I had the first time I worked on continuous-control tasks was: actions are no longer “choose A or B”, but “output a continuous vector”. At that point, using DQN to discretize actions is often neither elegant nor efficient. A smoother route is Actor-Critic: the Actor directly produces continuous actions, and the Critic evaluates whether this action is worthwhile under the current state.

My own “crash scene” back then was also typical: I discretized actions into 11 bins, 21 bins—it looked fine-grained, but the policy still behaved like it was jittering its leg. Either the bins weren’t fine enough so control stayed coarse, or the discretization got too fine, which blew up the action dimension and made learning both slow and unstable. Worse, reward curve noise got mixed with environment randomness, and you couldn’t tell whether it was an exploration issue or a value-estimation issue. So I switched fully to Actor-Critic: actions come straight out of the network; exploration is either done via noise (DDPG/TD3) or via entropy regularization (SAC). The whole picture becomes much clearer.

In this post, following my own practice order, I’ll connect the three most commonly used off-policy continuous-control algorithms: DDPG is the most basic deterministic Actor-Critic; TD3 mainly fixes DDPG’s tendency to overestimate and to oscillate; SAC introduces “entropy” directly into the objective, making exploration more natural and training more stable.

I’ll organize it by “what modules you need to implement when writing code”.

DDPG: The Minimal Engineering Structure of Deterministic Actor-Critic#

DDPG is very “engineering-minded”: the Actor outputs an action aa, and the Critic evaluates the Q value of (s,a)(s,a). Here is one key point I think you must remember: the Critic’s input is the concatenation of state + action, not state alone.

Network Structure#

  • Actor (policy network): input state, output continuous action; the output layer often uses tanh to constrain actions into [1,1][-1,1] (then map to the environment’s action range)
  • Critic (value network): input the concatenated vector [state, action], output Q(s,a)Q(s,a)

When implementing, I also pay attention to a small detail that’s easy to overlook: don’t initialize the output layers of Actor/Critic too “wild”. If initial action magnitudes are huge, or the Critic’s initial Q values are abnormally large, training often diverges within the first few thousand steps. Many implementations initialize the last layer weights in a smaller range for exactly this reason.

Replay Buffer#

Similar to DQN (push + sample), but the action in the transition is a continuous vector.

Exploration Noise#

DDPG is a deterministic policy, so exploration usually relies on adding noise to actions:

  • action = actor(state) + noise

My habit is: add noise during training to encourage exploration; turn noise off completely during testing, and only evaluate the learned deterministic policy’s return. Otherwise you’ll get a very “mystical” phenomenon: the test behavior still looks random, and you can’t tell whether the model has actually learned anything.

update: Critic first, then Actor, then soft-update targets#

In DDPG, I usually hardcode a fixed order for update():

  1. Sample a batch of transitions from replay
  2. Compute the Critic target via the Bellman equation
  3. Update the Critic (minimize TD error)
  4. Update the Actor (maximize the Critic’s QQ value for the current Actor’s actions)
  5. Update each target network’s parameters (soft update)

TD3: Twin Delayed DDPG (Fixing Overestimation and Instability)#

After getting DDPG running on more complex continuous-control tasks, very common pain points are: the return curve looks good for a while and then suddenly collapses; or the Critic’s Q values are overly optimistic, which misleads the Actor. TD3 was basically introduced to solve these issues. It brings the Double Q-learning idea of “be conservative by taking the smaller estimate” into DDPG.

From an engineering perspective, TD3’s most noticeable differences are mainly three things:

  1. Twin Critics: two Q1,Q2Q_1, Q_2; when computing targets, take the smaller one to reduce overestimation
  2. Delayed Actor Updates: update the Critic more frequently, and update the Actor less often (delayed policy update)
  3. (Many implementations also add target policy smoothing: add a small noise to the target action)

In practice, as long as you implement “twin critics” correctly and “delayed actor updates” correctly, you’ll see stability improve significantly. Other tricks (like target policy smoothing) can be added later as icing on the cake.

SAC: Soft Actor-Critic (Entropy Regularization Makes Exploration “Softer”)#

SAC is my favorite in continuous control, because it handles “exploration” more naturally: instead of forcing randomness by external noise, it directly encourages the policy to maintain some stochasticity in the objective (maximize entropy). In many cases this makes training more stable and less likely to get stuck in bad local optima.

Engineering-wise, SAC still boils down to three blocks: networks, replay, update. A common split is ValueNet / QNet / PolicyNet (some implementations omit ValueNet and use twin Q + target V instead, but the main idea stays the same).

You can choose either network decomposition; what matters is: in the update target you must correctly incorporate the entropy regularization term, and when updating the policy you must be able to get the action’s log_prob.

SAC Intuition#

If you understand PPO as “action probability distribution + on-policy constraints”, then SAC is more like:

In off-policy Q learning, encourage the policy to stay sufficiently stochastic (maximize entropy) so exploration is more stable.

Engineering Difference: Action Sampling in PolicyNet#

When implementing SAC, I explicitly split action-related functions in the policy network into two types:

evaluate() for training: return sampled actions and their log_prob (because the entropy term needs it); get_action() for interaction/testing: given a state, output an action (during testing you usually use a more deterministic way, like taking the mean, or turning off randomness).

You can think of them as the continuous-action version of sample/predict in the earlier posts: training needs probability information; testing only needs the behavior itself.

Summary#

Compressing these three algorithms into one engineering statement:

DDPG/TD3/SAC are all “Actor produces actions, Critic evaluates (s,a)(s,a), Replay enables off-policy”; the differences are mainly in how targets are computed in update(), whether twin critics are used, whether actor updates are delayed, and whether entropy regularization is introduced.

If you later want to implement these algorithms under a unified framework, I recommend keeping the same interface:

  • sample(state): training interaction (DDPG/TD3 add noise, SAC samples from the distribution)
  • predict(state): testing (no noise / mean action)
  • update(): sample from replay and update networks

Later I may continue this series into a “general engineering template repo”: unified logging, unified evaluation, unified save/load, and configuration management.

The Pitfalls I Most Often Hit in Continuous Control & Debugging Order#

Continuous-control training can “blow up” faster than discrete-action settings, especially for deterministic methods like DDPG/TD3—a slight scale mismatch can lead to immediate divergence. I usually debug in the following order:

  1. Is action scaling correct? Actor outputs are often tanh in [1,1][-1,1], while the environment action space may have a different range. If this mapping is wrong, training behaves very strangely (as if it can never learn a certain action direction).
  2. Noise: not too big, not too small: too big and actions fly everywhere, filling the buffer with bad data; too small and there’s no exploration. I suggest printing the noise itself and checking its magnitude relative to the action range.
  3. First watch whether Q values explode: periodically print mean/max of Q1/Q2; if Q values spike, suspect learning rate, reward scale, done handling, and target computation.
  4. Be conservative with target soft-update (tau): tau too large makes the target chase the online network and hurts stability; tau too small makes learning very slow. Start with conservative settings to get it running, then fine-tune.
  5. TD3 twin-critic min target must be correct: this is the core of reducing overestimation. If you mistakenly take max, or use the wrong critic for target, it can immediately degrade/diverge.
  6. For SAC, focus on entropy / alpha: if entropy weight is off, you’ll see “actions too random” or “become deterministic too early”. If your implementation supports automatic alpha tuning, it’s usually less hassle.
  7. Validate the closed loop with short horizons first: shorten episode length, simplify rewards, or even disable some randomness, confirm updates don’t blow up, then gradually restore real settings.
RL Practice (4): Continuous Control (DDPG/TD3/SAC)
https://xiaohei-blog.vercel.app/en/blog/rl-algorithm-4
Author 红鼻子小黑
Published at May 1, 2025
Comment seems to stuck. Try to refresh?✨