Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment
On-policy 和 off-policy 区别
RL 中的关键组件
Actor (Policy Model)、 Critic (Value Model)、Reward Model (或 rule-based reward)、Reference Model

变量定义
| $\pi_{\theta}$ | $a_t$ | $s_t$ | $r(s_t,a_t)$,也写作 $r_t$ | $\tau=\{s_t,a_t,r_t\}^T_{t=0}$ | $\gamma$ |
|---|---|---|---|---|---|
| 权重 $\theta$ 下的策略空间 | t 时刻的动作 | t 时刻的状态 | 单步 reward | 一条 rollout 轨迹 | 折现率 |
| $V^\pi(s)$ | $Q^\pi(s,a)$ | $R(\tau)=\sum_{t=0}^{\infty} \gamma^t r_t$ | $J(\theta) | ||
| =\mathbb{E}{\tau\sim\pi{\theta}}\!\Bigl[\,R(\tau)\Bigr]$ | $∇_θ J(\pi_{\theta})$ | $\mathcal{D}$ | |||
| 状态价值函数 | 动作价值函数 | 轨迹 $\tau$ 的折现总回报 | 轨迹回报的期望值,希望最大化的指标 | 策略梯度 | 数据集 |
定义策略梯度:
$\nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} \, \mathbb{E}{\tau \sim \pi{\theta}} \left[ R(\tau) \right] \newline \newline =\mathbb{E}{\tau \sim \pi{\theta}} \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \cdot R(\tau) \right]\newline\approx \frac{1}{|\mathcal{D}|} \sum_{\tau\in\mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta}\log\pi_{\theta}(a_t\mid s_t)\; \cdot R(\tau)$
$\nabla_{\theta} J(\pi_\theta)\newline\approx\frac{1}{|\mathcal D|}\sum_{\tau\in\mathcal D} \sum_{t=0}^{T} \nabla_{\theta}\log\pi_\theta(a_t \mid s_t)\;\cdot \underbrace{\sum_{k=0}^{T-t}\gamma^{\,k}\,r_{t+k}}_{G_t}$