智能体 (Agent) 与环境 (Environment) 的交互模型:
- 状态 (State) s∈S
- 动作 (Action) a∈A
- 策略 (Policy) π(a∣s)
状态空间:
S={si}i=1n
状态转移概率:
P(s′∣s,a)
动作空间:
A(si)={ai}i=1n
π:S→Δ(A):
- 确定性策略: π(s)=a
- 随机策略: π(a∣s)=P(at=a∣st=s)
奖励信号:
rt=R(st,at,st+1)
奖励概率:
P(r∣s,a)
A state-action-reward chain:
s1a2r=0s2a3r=0s5a3r=0s8a2r=1s9
Cumulative reward of trajectory:
- 无折扣回报 (Undiscounted Return): Gt=k=0∑∞rt+k+1
- 折扣回报 (Discounted Return): Gt=k=0∑∞γkrt+k+1
:::[Discount Rate]
γ∈[0,1):
- γ→0: 重视即时奖励
- γ→1: 重视长期回报
:::
五元组定义 (S,A,P,R,γ)
马尔可夫性质:
p(st+1∣st,at,st−1,at−1,…,s0,a0)=p(st+1∣st,at)
p(rt+1∣st,at,st−1,at−1,…,s0,a0)=p(rt+1∣st,at)
轨迹的概率分布:
P(τ∣π)=p(s1)t=1∏Tπ(at∣st)p(st+1∣st,at)