Skip to main content

Bellman Optimality Equation

Policy

最优策略:

vπ(s)vπ(s),sS,πv_{\pi^*}(s) \geq v_\pi(s), \forall s \in \mathcal{S}, \forall \pi

State Value

v(s)=maxπvπ(s)=maxπaAπ(as)q(s,a)=maxπaAπ(as)[rRp(rs,a)r(s,a,s)+γsSp(ss,a)vπ(s)]\begin{aligned} v^*(s) &= \underset{\pi}{\max} v_\pi(s) \\ &= \underset{\pi}{\max} \sum\limits_{a \in \mathcal{A}} \pi(a | s) q(s, a) \\ &= \underset{\pi}{\max} \sum\limits_{a \in \mathcal{A}} \pi(a | s) \left[ \sum\limits_{r \in \mathcal{R}} p(r | s, a)r(s, a, s') + \gamma \sum\limits_{s' \in \mathcal{S}} p(s' | s, a)v_\pi(s') \right] \end{aligned}

Action Value

q(s,a)=rRp(rs,a)r(s,a,s)+γsSp(ss,a)v(s)q^*(s, a) = \sum\limits_{r \in \mathcal{R}} p(r | s, a)r(s, a, s') + \gamma \sum\limits_{s' \in \mathcal{S}} p(s' | s, a)v^*(s')

Matrix-Vector Form

v=maxπ(rπ+γPπv)=rπ+γPπv\begin{aligned} \boldsymbol{v}^* &= \underset{\pi}{\max}(\boldsymbol{r}_\pi + \gamma \boldsymbol{P}_\pi \boldsymbol{v}^*) \\ &= \boldsymbol{r}_{\pi^*} + \gamma \boldsymbol{P}_{\pi^*} \boldsymbol{v}^* \end{aligned}

其中, π=argmaxπ(rπ+γPπv)\pi^*= \arg\underset{\pi}{\max}(\boldsymbol{r}_\pi + \gamma \boldsymbol{P}_\pi \boldsymbol{v}^*).