AIReinforcement LearningTemporal-DifferenceOn this pageTemporal-Difference State Value vt+1(st)⏟new estimate=vt(st)⏟current estimate−αt(st)[vt(st)−[rt+1+γvt(st+1)⏟TD target vˉt]⏞TD error δt]\underbrace{v_{t+1}(s_t)}_{\text{new estimate}} = \underbrace{v_t(s_t)}_{\text{current estimate}} - \alpha_t(s_t) [ \overbrace{v_t(s_t) - [ \underbrace{r_{t+1} + \gamma v_t(s_{t+1})}_{\text{TD target } \bar{v}_t} ]}^{\text{TD error } \delta_t} ]new estimatevt+1(st)=current estimatevt(st)−αt(st)[vt(st)−[TD target vˉtrt+1+γvt(st+1)]TD error δt] Action Value Sarsa On-policy TD control: qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})]]qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]] Expected Sarsa 降低 Sarsa 的方差: qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γ∑a∈Aπt(a∣st+1)qt(st+1,a)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma \sum_{a \in \mathcal{A}} \pi_t(a | s_{t+1}) q_t(s_{t+1}, a)]]qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γa∈A∑πt(a∣st+1)qt(st+1,a)]] nnn-Step Sarsa n-step return: qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γrt+2+⋯+γnqt(st+n,at+n)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})]]qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γrt+2+⋯+γnqt(st+n,at+n)]] Q-Learning Off-policy TD control: qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γmaxa∈Aqt(st+1,a)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - [r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a)]]qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γa∈Amaxqt(st+1,a)]] Off-Policy离线策略控制: 行为策略 ≠\neq= 目标策略