Skip to main content

Temporal-Difference

State Value

vt+1(st)new estimate=vt(st)current estimateαt(st)[vt(st)[rt+1+γvt(st+1)TD target vˉt]TD error δt]\underbrace{v_{t+1}(s_t)}_{\text{new estimate}} = \underbrace{v_t(s_t)}_{\text{current estimate}} - \alpha_t(s_t) [ \overbrace{v_t(s_t) - [ \underbrace{r_{t+1} + \gamma v_t(s_{t+1})}_{\text{TD target } \bar{v}_t} ]}^{\text{TD error } \delta_t} ]

Action Value

Sarsa

On-policy TD control:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γqt(st+1,at+1)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})]]

Expected Sarsa

降低 Sarsa 的方差:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γaAπt(ast+1)qt(st+1,a)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma \sum_{a \in \mathcal{A}} \pi_t(a | s_{t+1}) q_t(s_{t+1}, a)]]

nn-Step Sarsa

n-step return:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γrt+2++γnqt(st+n,at+n)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})]]

Q-Learning

Off-policy TD control:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γmaxaAqt(st+1,a)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a)]]
Off-Policy

离线策略控制: 行为策略 \neq 目标策略