蒙特卡罗模拟, 基于经验均值的估计 (Law of Large Numbers, 大数定理):
v(s)≈N(s)1i=1∑N(s)Gi(s)
- Policy evaluation: qπk(s,a)=E[Gt∣St=st,At=a]≈N1∑i=1Ng(i)(s,a)
- Policy improvement: πk+1(s)=argπmaxa∑π(a∣s)qπk(s,a)→πk+1(a∣s)=1 if a=ak∗
反向 update estimate, 更高效地使用数据, 避免重复计算,
并且保证每个状态-动作对 (s,a) 都有机会作为 episode 起点:
- Episode generation
- Policy evaluation and policy improvement: start from T−1 step of episode
ϵ-greedy policy, 平衡 exploitation 和 exploration:
π(a∣s)={1−∣A(s)∣ϵ(∣A(s)∣−1),∣A(s)∣ϵ,a=argamaxQ(s,a)otherwise
- ϵ=0: greedy policy → exploitation
- ϵ=1: uniform distribution → exploration