Skip to main content

n-step Bootstrapping

n-step TD Prediction

Recall that our one-step return used for TD(0)\operatorname{TD}(0) was:

Gt:t+1=Rt+1+γVt(St+1)G_{t: t+1}=R_{t+1}+\gamma V_t\left(S_{t+1}\right)

we can generalize this to the nn-step case as follows:

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnVt+n1(St+n)G_{t: t+n}=R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^n V_{t+n-1}\left(S_{t+n}\right)

for all n,tn, t such that n1n \geq 1 and 0tTn0 \leq t \leq T-n. All nn-step returns can be considered approximations to the full return, truncated after nn steps and then corrected for the remaining missing terms by Vt+n1(St+n)V_{t+n-1}\left(S_{t+n}\right). If the n-step return extends to or beyond termination then all the missing terms are taken as zero.

Pseudocode

The nn-step return uses the value function Vt+n1V_{t+n-1} to correct for the missing rewards beyond Rt+nR_{t+n}. An important property of nn-step returns is that their expectation is guaranteed to be a better estimate of vπv_\pi than Vt+n1V_{t+n-1} is, in a worst-state sense. That is, the worst error of the expected nn-step return is guaranteed to be less than or equal to γn\gamma^n times the worst error under Vt+n1V_{t+n-1} :

maxsEπ[Gt:t+nSt=s]vπ(s)γnmaxsVt+n1(s)vπ(s)\max _s\left|\mathbb{E}_\pi\left[G_{t: t+n} \mid S_t=s\right]-v_\pi(s)\right| \leq \gamma^n \max _s\left|V_{t+n-1}(s)-v_\pi(s)\right|

for all n1n \geq 1. This is called the error reduction property of nn-step returns. Because of the error reduction property, one can show formally that all nn-step TD methods converge to the correct predictions under appropriate technical conditions. The nn-step TD methods thus form a family of sound methods, with one-step TD methods and Monte Carlo methods as extreme members.

n-step Sarsa

The main idea is to simply switch states for actions (state-action pairs) and then use an ε\varepsilon-greedy policy. The backup diagrams for nn-step Sarsa (shown in Figure 7.3), like those of nn-step TD (Figure 7.1), are strings of alternating states and actions, except that the Sarsa ones all start and end with an action rather a state. We redefine nn-step returns (update targets) in terms of estimated action values:

Gt:t+nRt+1+γRt+2++γn1Rt+n+γnQt+n1(St+n,At+n),n1,0t<TnG_{t: t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^n Q_{t+n-1}\left(S_{t+n}, A_{t+n}\right), \quad n \geq 1,0 \leq t<T-n

with Gt:t+nGtG_{t: t+n} \doteq G_t if t+nTt+n \geq T. The natural algorithm is then

Qt+n(St,At)Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)],0t<T, (7.5) Q_{t+n}\left(S_t, A_t\right) \doteq Q_{t+n-1}\left(S_t, A_t\right)+\alpha\left[G_{t: t+n}-Q_{t+n-1}\left(S_t, A_t\right)\right], \quad 0 \leq t<T, \text { (7.5) }

while the values of all other states remain unchanged: Qt+n(s,a)=Qt+n1(s,a)Q_{t+n}(s, a)=Q_{t+n-1}(s, a), for all s,as, a such that sSts \neq S_t or aAta \neq A_t. This is the algorithm we call nn-step Sarsa. Pseudocode is shown in the box on the next page, and an example of why it can speed up learning compared to one-step methods is given in Figure 7.4.

Pseudocode

n-step Off-policy Learning

In nn-step methods, returns are constructed over nn steps, so we are interested in the relative probability of just those nn actions. For example, to make a simple off-policy version of nn-step TD, the update for time tt (actually made at time t+nt+n ) can simply be weighted by ρt:t+n1\rho_{t: t+n-1}:

Vt+n(St)Vt+n1(St)+αρt:t+n1[Gt:t+nVt+n1(St)],0t<T,V_{t+n}\left(S_t\right) \doteq V_{t+n-1}\left(S_t\right)+\alpha \rho_{t: t+n-1}\left[G_{t: t+n}-V_{t+n-1}\left(S_t\right)\right], \quad 0 \leq t<T,

where ρt:t+n1\rho_{t: t+n-1}, called the importance sampling ratio, is the relative probability under the two policies of taking the nn actions from AtA_t to At+n1A_{t+n-1} (cf. Eq. 5.3):

ρt:hk=tmin(h,T1)π(AkSk)b(AkSk)\rho_{t: h} \doteq \prod_{k=t}^{\min (h, T-1)} \frac{\pi\left(A_k \mid S_k\right)}{b\left(A_k \mid S_k\right)}

Pseudocode

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

In this tree backup update, the target now includes all rewards plus the estimated values of the dangling action nodes hanging off the sides, at all levels. It is an update from the entire tree of estimated action values. For each node in the tree backup diagram, the estimated values of the non-selected actions are weighted by their probability of being selected under our policy π(AtSt)\pi\left(A_t \mid S_t\right). The value of the selected action does not contribute at all at this stage, instead its probability of being selected weights the instantaneous reward of the next state and each of the non-selected actions at the next state, which too are weighted by their probabilities of occurring as described previously. Formally, the one-step return is as follows:

Gt:t+1Rt+1+γaπ(aSt+1)Qt(St+1,a)G_{t: t+1} \doteq R_{t+1}+\gamma \sum_a \pi\left(a \mid S_{t+1}\right) Q_t\left(S_{t+1}, a\right)

for t<T1t<T-1, and the two-step tree-backup return is

Gt:t+2Rt+1+γaAt+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)(Rt+2+γaπ(aSt+2)Qt+1(St+2,a))=Rt+1+γaAt+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)Gt+1:t+2,\begin{align*} G_{t: t+2} & \doteq R_{t+1}+\gamma \sum_{a \neq A_{t+1}} \pi\left(a \mid S_{t+1}\right) Q_{t+1}\left(S_{t+1}, a\right) \\ & +\gamma \pi\left(A_{t+1} \mid S_{t+1}\right)\left(R_{t+2}+\gamma \sum_a \pi\left(a \mid S_{t+2}\right) Q_{t+1}\left(S_{t+2}, a\right)\right) \\ & =R_{t+1}+\gamma \sum_{a \neq A_{t+1}} \pi\left(a \mid S_{t+1}\right) Q_{t+1}\left(S_{t+1}, a\right)+\gamma \pi\left(A_{t+1} \mid S_{t+1}\right) G_{t+1: t+2}, \end{align*}

for t<T2t<T-2. The latter form suggests the general recursive definition of the tree-backup nn-step return:

Gt:t+nRt+1+γaAt+1π(aSt+1)Qt+n1(St+1,a)+γπ(At+1St+1)Gt+1:t+n,G_{t: t+n} \doteq R_{t+1}+\gamma \sum_{a \neq A_{t+1}} \pi\left(a \mid S_{t+1}\right) Q_{t+n-1}\left(S_{t+1}, a\right)+\gamma \pi\left(A_{t+1} \mid S_{t+1}\right) G_{t+1: t+n},

for t<T1,n2t<T-1, n \geq 2. This target is then used with the usual action-value update rule from nn-step Sarsa:

Qt+n(St,At)Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)],Q_{t+n}\left(S_t, A_t\right) \doteq Q_{t+n-1}\left(S_t, A_t\right)+\alpha\left[G_{t: t+n}-Q_{t+n-1}\left(S_t, A_t\right)\right],

for 0t<T0 \leq t<T, while the values of all other state-action pairs remain unchanged: Qt+n(s,a)=Qt+n1(s,a)Q_{t+n}(s, a)=Q_{t+n-1}(s, a), for all s,as, a such that sSts \neq S_t or aAta \neq A_t.

Pseudocode