Extended Data Fig. 2: Learning the distribution of returns improves performance of deep RL agents across multiple domains. | Nature

Extended Data Fig. 2: Learning the distribution of returns improves performance of deep RL agents across multiple domains.

From: A distributional code for value in dopamine-based reinforcement learning

Extended Data Fig. 2

a, DQN and distributional TD share identical nonlinear network structures. b, c, After training classical or distributional DQN on MsPacman, we freeze the agent and then train a separate linear decoder to reconstruct frames from the agent’s final layer representation. For each agent, reconstructions are shown. The distributional model’s representation allows substantially better reconstruction. d, At a single frame of MsPacman (not shown), the agent’s value predictions together represent a probability distribution over future rewards. Reward predictions of individual RPE channels shown as tick marks ranging from pessimistic (blue) to optimistic (red), and kernel density estimate shown in black. e, Atari-57 experiments with single runs of prioritized experience replay40 and double DQN41 agents for reference. Benefits of distributional learning exceed other popular innovations. f, g, The performance pay-off of distributional RL can be seen across a wide diversity of tasks. Here we give another example, a humanoid motor-control task in the MuJoCo physics simulator. Prioritized experience replay agent is shown for reference14. Traces show individual runs; averages are in bold.

Back to article page