RL Notes (3): From MC and TD(0) to Sarsa / Q-learning • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

If you only learn RL by reading formulas, it’s easy to “understand it, but not be able to write it.” I’ve always felt tabular methods are the best training ground:

Their advantage is that you can pull your attention out of all the neural-network noise: no optimizer tricks, no distraction from normalization or gradient explosion. Every update you write is almost literally “translating the Bellman equation into code.” And concepts like on-policy/off-policy become obvious because the tabular world is transparent.

In this chapter, we follow the docs’ flow: first do model-free prediction (MC / TD), then transition to model-free control (Sarsa / Q-learning). At the end, I’ll add a few practical tips on “how not to crash in engineering with tabular methods.”

The premise of tabular methods: a lookup table can hold your world#

The docs put it plainly: the simplest policy representation is a lookup table, so the core resource of tabular methods is:

number of states $|S|$
number of actions $|A|$

Once $|S| \times |A|$ becomes too large to fit in memory, or you can’t enumerate it at all (e.g., continuous states), you need deep methods.

But for getting started and validating intuition, tabular methods are still unbeatable.

Model-free prediction: MC vs TD#

Monte Carlo policy evaluation (MC)#

The key characteristic of MC: wait until an episode finishes, then update using the true return.

Pros: unbiased (with enough samples). Cons: high variance, slow updates, must wait for episode end.

One-step temporal difference TD(0)#

The docs give the TD target:

\text{TD target} = r_{t+1} + \gamma V(s_{t+1})

Its temperament is: “I won’t wait for the ending—I’ll update now using the next-step estimate.”

Pros: online updates, efficient. Cons: bootstrapping introduces bias, and it’s more sensitive to initialization and learning rate.

Model-free control: from V(s) to Q(s,a)#

To do control (find the optimal policy), $V(s)$ alone is not enough because you need to compare actions.

The docs emphasize using $Q(s,a)$ to judge “which action in which state yields the highest reward.”

In the tabular world, $Q$ is simply a “state × action” table.

Exploration: why ε-greedy#

The docs mention an important assumption: to guarantee convergence of policy iteration, you usually need “exploring starts” or sufficient exploration.

The most common engineering choice is ε-greedy:

with probability $\epsilon$ , pick a random action (explore)
with probability $1-\epsilon$ , pick the current best action (exploit)

A practical suggestion:

start with a larger $\epsilon$ (0.8/1.0)
then gradually decay to a small but non-zero value (0.05/0.1)

Sarsa: a typical on-policy control method#

The name Sarsa comes from the 5-tuple used in the update:

$(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$

It updates using “the action that will actually be executed next,” so it’s on-policy.

Intuition:

if you execute ε-greedy, then the update also accounts for the “occasionally silly random actions”
so it is more “cautious,” and can be safer in risky environments

Q-learning: a typical off-policy control method#

The docs explain the two policies in off-policy well:

behavior policy: explores and collects data (can be ε-greedy)
target policy: learns the optimal policy (usually greedy)

Q-learning uses $\max_a Q(s_{t+1}, a)$ in the update, corresponding to “I assume the future will always take the optimal action.”

Intuition:

it can explore more boldly because the learning target is greedily optimal
but it is also more prone to overestimation (even more obvious in deep versions like DQN)

Chapter summary: tabular methods are “interpretable RL”#

The biggest value of tabular methods is not solving the most complex problems, but making RL’s key questions very transparent:

are you exploring?
are you updating with the actual action (on-policy) or the optimal action (off-policy)?
are your $\alpha$ and $\gamma$ reasonable?

Starting next chapter, we’ll switch to another main line: policy gradients. When the action space becomes continuous, or when we want to directly learn a stochastic policy distribution, PG can feel more natural than a Q table.