Extended Data Fig. 8: Different methods for calculating reward expectation produce similar results. | Nature

Extended Data Fig. 8: Different methods for calculating reward expectation produce similar results.

From: Dissociable dopamine dynamics for learning and motivation

Extended Data Fig. 8

Left column, average firing rate of dopamine cells around Side-in, broken down by terciles of reward expectation, based either on recent reward rate (top; same as Fig. 5a), number of rewards in previous ten trials, state value (V) of an actor-critic model or state value (Qleft + Qright) of a Q-learning model. The actor-critic and Q-learning models were both trial-based, rather than evolving continuously in time. The actor-critic model estimated the overall probability of receiving a reward on each trial, V, using the update rule V′ = V + alpha(RPE), in which RPE = actual reward [1 or 0] − V. The Q-learning model kept separate estimates of the probabilities of receiving rewards for left and right choices (Qleft and Qright) and updated Q for the chosen action (only) using Q′ = Q + alpha(RPE), in which RPE = actual reward [1 or 0] – Q. The learning parameter alpha was determined for each session by best fit to latencies, for V or (Qleft + Qright) respectively. The subsequent columns show correlations between reward expectation and dopamine cell firing after Side-in, measuring either peak firing rate (within 250 ms after rewarded Side-in), minimum firing rate (middle; within 2 s after unrewarded Side-in) and pause duration (bottom; maximum inter-spike-interval within 2 s after unrewarded Side-in). For all histograms, light blue indicates cells with significant correlations (P < 0.01) before multiple comparisons correction, dark blue indicates cells that remained significant after correction. Positive RPE coding is strong and consistent, negative RPE coding is less so.

Back to article page