平均状态值:
vˉπ=s∈S∑dπ(s)vπ(s)=E[t=0∑∞γtRt+1]
∇θvˉπ=s∈S∑dπ(s)a∈A∑∇θπ(a∣s,θ)qπ(s,a)
平均奖励:
rˉπ=s∈S∑dπ(s)a∈A∑π(a∣s)r(s,a)=n→∞limn1E[k=1∑nRt+k]=(1−γ)vˉπ
∇θrˉπ=s∈S∑dπ(s)a∈A∑∇θπ(a∣s,θ)qπ(s,a)
利用梯度上升算法 (Gradient Ascent),
最大化长期奖励 (learn from rewards and mistakes):

θ∗=argθmaxRˉθ=argθmaxτ∑P(τ∣θ)R(τ)
Policy gradient by Monte Carlo:
θt+1=θt+αES∼d,A∼π[∇θlnπ(A∣S,θ)qπ(S,A)]=θt+α∇θlnπ(at∣st,θt)qπ(st,at)=θt+αβt(π(at∣st,θt)qπ(st,at))∇θπ(at∣st,θt)
- βt∝qπ(st,at): exploitation
- βt∝π(at∣st,θt)1: exploration
减少方差:
θt+1=θt+α∇θlnπ(at∣st,θt)(qπ(st,at)−b(st))=θt+α∇θlnπ(at∣st,θt)(qπ(st,at)−vπ(st))