Skip to main content

Bellman Equation

State Value

Reward Expectation

在策略 π\pi 下从状态 ss 出发的期望回报:

vπ(s)=E[GtSt=s]v_\pi(s) = \mathbb{E}[G_t | S_t = s]

Total Expectation

基于全期望公式, 即时奖励 + 折扣后的未来状态值:

vπ(s)=aAπ(as)[rRp(rs,a)r(s,a,s)+γsSp(ss,a)vπ(s)]v_\pi(s) = \sum\limits_{a \in \mathcal{A}} \pi(a | s) \left[ \sum\limits_{r \in \mathcal{R}} p(r | s, a)r(s, a, s') + \gamma \sum\limits_{s' \in \mathcal{S}} p(s' | s, a)v_\pi(s') \right]

特殊地, 奖励取决于状态转移时:

vπ(s)=aAπ(as)sSp(ss,a)[r(s,a,s)+γvπ(s)]v_\pi(s) = \sum\limits_{a \in \mathcal{A}} \pi(a | s) \sum\limits_{s' \in \mathcal{S}} p(s' | s, a) [r(s, a, s') + \gamma v_\pi(s')]

Action Value

Q-Function

E[GtSt=s]=aAπ(as)E[GtSt=s,At=a]\begin{aligned} \mathbb{E}[G_t | S_t = s] &= \sum\limits_{a \in \mathcal{A}} \pi(a | s) \mathbb{E}[G_t | S_t = s, A_t = a] \\ \end{aligned}

得到

vπ(s)=aAπ(as)qπ(s,a)v_\pi(s) = \sum\limits_{a \in \mathcal{A}} \pi(a | s) q_\pi(s, a)

qπ(s,a)=rRp(rs,a)r(s,a,s)+γsSp(ss,a)vπ(s)q_\pi(s, a) = \sum\limits_{r \in \mathcal{R}} p(r | s, a)r(s, a, s') + \gamma \sum\limits_{s' \in \mathcal{S}} p(s' | s, a)v_\pi(s')

Matrix-Vector Form

Pπ\boldsymbol{P}_\pi 为策略 π\pi 下的状态转移矩阵:

vπ=rπ+γPπvπ\boldsymbol{v}_\pi = \boldsymbol{r}_\pi + \gamma \boldsymbol{P}_\pi \boldsymbol{v}_\pi

解的存在性和唯一性:

(IγPπ)vπ=rπ(\boldsymbol{I} - \gamma \boldsymbol{P}_\pi) \boldsymbol{v}_\pi = \boldsymbol{r}_\pi vπ=(IγPπ)1rπ\boldsymbol{v}_\pi = (\boldsymbol{I} - \gamma \boldsymbol{P}_\pi)^{-1} \boldsymbol{r}_\pi