RL Notes (10): Sparse Rewards • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

I think sparse rewards are one of the most “mind-breaking” RL scenarios:

you train for millions of steps and reward is still 0;
you don’t even know whether your algorithm is wrong, or the environment is just too hard;
you want to debug, but there’s no signal.

The docs give a very practical idea in this chapter:

Either you design the reward to be more “dense” (reward shaping) so the agent can see progress earlier; or you introduce intrinsic rewards (curiosity-driven rewards) so when external feedback is near zero, “novelty” can keep the learning signal alive.

I’ll explain it in that order, and I’ll focus on breaking down the ICM structure because it’s a prototype for many intrinsic-motivation methods.

1) Reward Shaping#

The docs say “designing reward is guiding reward,” which is accurate: you add denser feedback to the environment so the agent knows whether it is making progress.

Common examples:

robot locomotion: besides rewarding “not falling,” also reward forward distance
navigation: reward reduction in distance to the goal

2) Curiosity-driven reward and ICM#

The docs mention ICM: add a “curiosity” reward function to the agent.

The core intuition of ICM is beautiful:

If the next state is hard to predict, it means you reached something “novel,” so you get reward.

Inputs and outputs of ICM#

The structure in the docs: input $(s_t, a_t, s_{t+1})$ , output intrinsic reward $r_t^i$ .

And the total reward used for training is:

r_t^{\text{total}} = r_t + \beta r_t^i

where $\beta$ is the intrinsic reward weight.

Add a feature extractor: filter meaningless noise#

The docs also point out a key issue: curiosity alone is not enough—the agent may get addicted to “noise” (like TV static).

The solution: add a feature extractor to map states into a more meaningful feature space, then compute prediction error in that feature space.

Chapter summary: if there’s no signal, create one first#

Sparse reward problems are essentially “too little learning signal.” Reward shaping and curiosity are two ways to create signal:

shaping is more direct, but depends more on human design
curiosity is more general, but easier to go off-track

Next chapter we’ll talk about imitation learning: when you don’t even want to design rewards, or rewards are hard to define, use expert demonstrations to directly teach the agent what to do.