Skip to main content

Iteration

方法策略评估策略改进特点
值迭代单步更新隐式改进计算效率高
策略迭代精确求解贪婪改进每轮改进保证
截断策略迭代有限步迭代贪婪改进灵活平衡

Value

k1k*1 次迭代:

  1. Policy update (11): πk+1=argmaxπ(rπ+γPπvk)\pi_{k + 1} = \arg\underset{\pi}{\max}(r_\pi + \gamma P_\pi v_k)
  2. Value update: vk+1=rπk+1+γPπk+1vkv_{k + 1} = r_{\pi_{k + 1}} + \gamma P_{\pi_{k + 1}} v_k

Policy

kk*\infty 次迭代:

  1. Policy evaluation (\infty): vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0, 1, 2, \dots
  2. Policy improvement: πk+1=argmaxπ(rπ+γPπvπk)\pi_{k + 1} = \arg\underset{\pi}{\max}(r_\pi + \gamma P_\pi v_{\pi_k})

Truncated Policy

广义策略迭代 (Generalized Policy Iteration):

  • 不求精确解, 只迭代有限 jj
  • 值迭代和策略迭代是特例

Iteration