Skip to main content

Markov Decision Process

智能体 (Agent) 与环境 (Environment) 的交互模型:

  • 状态 (State) sSs \in \mathcal{S}
  • 动作 (Action) aAa \in \mathcal{A}
  • 策略 (Policy) π(as)\pi(a | s)

State

Space

状态空间:

S={si}i=1n\mathcal{S} = \{s_i\}_{i=1}^n

Transition

状态转移概率:

P(ss,a)P(s' | s, a)

Action

动作空间:

A(si)={ai}i=1n\mathcal{A}(s_i) = \{a_i\}_{i=1}^n

Policy

π:SΔ(A)\pi: \mathcal{S} \to \Delta(\mathcal{A}):

  • 确定性策略: π(s)=a\pi(s) = a
  • 随机策略: π(as)=P(at=ast=s)\pi(a | s) = P(a_t = a | s_t = s)

Reward

Signal

奖励信号:

rt=R(st,at,st+1)r_t = \mathcal{R}(s_t, a_t, s_{t+1})

Probability

奖励概率:

P(rs,a)P(r | s, a)

Trajectory

A state-action-reward chain:

s1r=0a2s2r=0a3s5r=0a3s8r=1a2s9s_1 \xrightarrow[r=0]{a_2} s_2 \xrightarrow[r=0]{a_3} s_5 \xrightarrow[r=0]{a_3} s_8 \xrightarrow[r=1]{a_2} s_9

Return

Cumulative reward of trajectory:

  • 无折扣回报 (Undiscounted Return): Gt=k=0rt+k+1G_t = \sum\limits_{k=0}^{\infty} r_{t+k+1}
  • 折扣回报 (Discounted Return): Gt=k=0γkrt+k+1G_t = \sum\limits_{k=0}^{\infty} \gamma^k r_{t+k+1}

:::[Discount Rate]

γ[0,1)\gamma \in [0, 1):

  • γ0\gamma \to 0: 重视即时奖励
  • γ1\gamma \to 1: 重视长期回报

:::

MDP

Definition

五元组定义 (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma)

Property

马尔可夫性质:

p(st+1st,at,st1,at1,,s0,a0)=p(st+1st,at)p(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) = p(s_{t+1} | s_t, a_t) p(rt+1st,at,st1,at1,,s0,a0)=p(rt+1st,at)p(r_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) = p(r_{t+1} | s_t, a_t)

Distribution

轨迹的概率分布:

P(τπ)=p(s1)t=1Tπ(atst)p(st+1st,at)P(\tau | \pi) = p(s_1) \prod_{t=1}^{T} \pi(a_t | s_t) p(s_{t+1} | s_t, a_t)