Xiaohei's Blog
headpicBlur image

Preface#

When I run reinforcement learning experiments, the most painful part is not deriving equations, but this: the same algorithm in a different environment often forces you to rewrite a lot of training scaffolding. Later I simply summarized the common patterns into a “reusable algorithm skeleton”, and then filled in the differences of each algorithm (which are almost always concentrated in sample and update).

More specifically, most of the pitfalls I’ve hit came from “engineering details being ignored”: sometimes the formula isn’t wrong, but your training loop didn’t handle terminated/truncated properly, epsilon decayed too fast and converged prematurely, or you didn’t cap the max steps per episode and your training statistics became incomparable. In the end, you can only stare at a messy reward curve, with no idea whether the problem is the environment or the code.

So in this post I will deliberately make the “program skeleton” more rigid: what functions you should have, how to arrange train/eval loops, what to print at each step, so that when you write the second and third algorithms you won’t have to build scaffolding from scratch.

This is the first article in the series, with a very clear goal:

  • Provide a reusable RL project code structure (train / test / env / configs)
  • Implement the classic Q-Learning (off-policy)
  • And Sarsa (on-policy), which differs from it in one key place

In later posts, I’ll continue by splitting into “value-function family (DQN family)” and “policy gradient / Actor-Critic family”:

  • Part 2: DQN / Double DQN / Dueling DQN / Noisy DQN / PER-DQN
  • Part 3: Policy Gradient (REINFORCE) / PPO / A2C
  • Part 4: DDPG / TD3 / SAC (the three classics for continuous control)

A General RL Code Skeleton#

Whether it’s tabular methods (Q-table) or deep RL (DQN and variants), I like to abstract an agent into four core actions:

  • sample(state): sample an action during training (with exploration)
  • predict(state): output an action during evaluation (no exploration, exploitation)
  • update(transition): update the policy using interaction data
  • save()/load(): save and load (optional, but strongly recommended)

The most important reminder is: across algorithms, the implementations of sample and update vary a lot; other parts (training loop, logging, save/load) are usually similar.

Training Loop (Define Training)#

An “episode-based” training loop is basically the following sequence:

  1. Episode start: state = env.reset()
  2. Set a per-episode max step limit max_steps (helps convergence and avoids never-ending episodes)
  3. Interact in a loop until termination:
    • action = agent.sample(state) (exploration policy)
    • next_state, reward, terminated, truncated, info = env.step(action)
    • Build transition (optionally store in memory)
    • agent.update(transition)
    • state = next_state
    • If terminated or truncated, end the episode

Here is rigid pseudo-code (I recommend copying it as a project template):

def train(env, agent, num_episodes: int, max_steps: int):
	for ep in range(num_episodes):
		state, info = env.reset()
		ep_return = 0.0

		for t in range(max_steps):
			action = agent.sample(state)
			next_state, reward, terminated, truncated, info = env.step(action)

			agent.update(state, action, reward, next_state, terminated)

			ep_return += reward
			state = next_state
			if terminated or truncated:
				break

		print(f"episode={ep} return={ep_return:.1f} eps={agent.epsilon:.3f}")
python

Evaluation Loop (Define Testing)#

Evaluation looks very similar to training, but two things must change:

  1. No updates: evaluation is only for measuring performance
  2. No sampling: use predict for pure exploitation
def evaluate(env, agent, num_episodes: int, max_steps: int):
	returns = []
	for ep in range(num_episodes):
		state, info = env.reset()
		ep_return = 0.0
		for t in range(max_steps):
			action = agent.predict(state)
			next_state, reward, terminated, truncated, info = env.step(action)
			ep_return += reward
			state = next_state
			if terminated or truncated:
				break
		returns.append(ep_return)
		print(f"[eval] episode={ep} return={ep_return:.1f}")
	return sum(returns) / len(returns)
python

Environment: Gym Is Enough (For Custom Envs, Only Care About reset/step)#

Most of the time I won’t build my own environment: Gym/Gymnasium is enough.

If you must write a custom environment, the key is aligning these two interfaces:

  • reset(): start an episode and return the initial state
  • step(action): take an action and return (next_state, reward, terminated, truncated, info)

Q-Learning: From “Skeleton” to a Runnable Algorithm#

The Three Core Functions of Q-Learning#

In tabular Q-Learning we maintain a table Q(s,a)Q(s,a). Plugging it into the skeleton, you’ll find:

  • predict(state): choose argmaxaQ(s,a)\arg\max_a Q(s,a)
  • sample(state): add exploration on top of predict (epsilon-greedy / UCB, etc.)
  • update(...): the Bellman (TD) update

epsilon-greedy (The Most Common Exploration Strategy)#

Training is usually: explore more early and gradually converge later, i.e., decay ϵ\epsilon from large to small.

Typically we set a triple:

  • epsilon_start: initial exploration rate (often 0.95)
  • epsilon_end: minimum exploration rate (often 0.01; keep a bit of exploration)
  • epsilon_decay: decay speed (too fast causes premature convergence/overfitting; too slow delays convergence)

A simple decay rule:

ϵmax(ϵend,ϵϵdecay)\epsilon \leftarrow \max(\epsilon_{end}, \epsilon \cdot \epsilon_{decay})

A “Directly Reusable” Q-Learning Class#

The code below is my minimal implementation (discrete state/action). It’s written as a class to keep an interface consistent with deep algorithms like DQN.

import random
from collections import defaultdict
from dataclasses import dataclass


@dataclass
class QLearningConfig:
	gamma: float = 0.99
	lr: float = 0.1
	epsilon_start: float = 0.95
	epsilon_end: float = 0.01
	epsilon_decay: float = 0.995


class QLearningAgent:
	def __init__(self, n_actions: int, cfg: QLearningConfig):
		self.n_actions = n_actions
		self.cfg = cfg

		self.Q = defaultdict(lambda: [0.0 for _ in range(n_actions)])
		self.epsilon = cfg.epsilon_start

	def sample(self, state):
		# Exploration + exploitation
		if random.random() < self.epsilon:
			return random.randrange(self.n_actions)
		return self.predict(state)

	def predict(self, state):
		q = self.Q[state]
		return int(max(range(self.n_actions), key=lambda a: q[a]))

	def update(self, state, action, reward, next_state, terminated: bool):
		q_sa = self.Q[state][action]
		next_q_max = 0.0 if terminated else max(self.Q[next_state])
		target = reward + self.cfg.gamma * next_q_max
		self.Q[state][action] = q_sa + self.cfg.lr * (target - q_sa)

		# decay epsilon once per step (or per episode, both ok; step is finer-grained)
		self.epsilon = max(self.cfg.epsilon_end, self.epsilon * self.cfg.epsilon_decay)

	def save(self, path: str):
		import json
		with open(path, 'w', encoding='utf-8') as f:
			json.dump({str(k): v for k, v in self.Q.items()}, f, ensure_ascii=False)

	def load(self, path: str):
		import json
		with open(path, 'r', encoding='utf-8') as f:
			obj = json.load(f)
		self.Q = defaultdict(lambda: [0.0 for _ in range(self.n_actions)])
		for k, v in obj.items():
			self.Q[k] = v
python

Sarsa: Only One update Different From Q-Learning#

The core difference between Sarsa and Q-Learning in one sentence:

  • Sarsa: update using the actually executed next action (on-policy)
  • Q-Learning: update using the assumed optimal next action (off-policy)

Written as update targets (only look at the difference):

  • Q-Learning: r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a')
  • Sarsa: r+γQ(s,anext)r + \gamma Q(s', a_{next}), where anexta_{next} is sampled by the current policy at ss'

Minimal Sarsa Implementation (Only Show update Difference)#

Sarsa’s code structure is almost the same as Q-Learning; the key difference is that update needs next_action:

class SarsaAgent(QLearningAgent):
	def update(self, state, action, reward, next_state, next_action, terminated: bool):
		q_sa = self.Q[state][action]
		next_q = 0.0 if terminated else self.Q[next_state][next_action]
		target = reward + self.cfg.gamma * next_q
		self.Q[state][action] = q_sa + self.cfg.lr * (target - q_sa)
		self.epsilon = max(self.cfg.epsilon_end, self.epsilon * self.cfg.epsilon_decay)
python

Accordingly, the training loop needs one line changed: get next_action before updating.

action = agent.sample(state)
next_state, reward, terminated, truncated, info = env.step(action)
next_action = agent.sample(next_state)
agent.update(state, action, reward, next_state, next_action, terminated)
state = next_state
action = next_action
python

Summary#

If you only remember one thing, it should be this:

  • The training/evaluation loops can be fixed as templates; you mainly change sample/predict/update

In the next post, I’ll upgrade this template to a “deep RL” version: introduce a Replay Buffer and target networks, and then explain DQN and several common improvements (Double / Dueling / Noisy / PER).

My Usual Hyperparameter/Debugging Order (Very Practical)#

Finally, I’ll leave a short “troubleshooting workflow” that I use most often. The downside of reinforcement learning is: unlike supervised learning, you can’t easily see overfitting at a glance. It’s more like a stew—a problem at any step can show up as “reward doesn’t increase”. I usually locate issues in this order:

  1. Make sure the environment runs end-to-end: Can a random policy finish an episode? Are terminated/truncated reasonable? What is the rough reward scale?
  2. Make logs specific enough: per-episode return, epsilon per step/episode, and whether the episode ends early. For tabular methods, I also print the scale of max(Q[state]) to see if it blows up.
  3. Don’t decay epsilon too fast: my most common mistake is setting epsilon_decay too aggressive, causing “confidently doing nonsense” after only dozens of episodes. If reward fluctuates early and then quickly gets stuck, suspect insufficient exploration first.
  4. Match gamma and max_steps: when gamma is large, the effective return horizon is longer; if max_steps is short, many environments become “impossible to learn”. The opposite also holds: too long max_steps makes training slower with higher variance.
  5. Verify on a minimal environment first: FrozenLake / Taxi help quickly validate whether your training loop is correct. Confirm the skeleton first; moving to a complex environment later saves a lot of time.
RL Practice (1): Framework + Q-Learning & Sarsa
https://xiaohei-blog.vercel.app/en/blog/rl-algorithm-1
Author 红鼻子小黑
Published at May 1, 2025
Comment seems to stuck. Try to refresh?✨