Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment

https://huggingface.co/blog/NormalUhr/rlhf-pipeline

基本概念

算法演进:策略梯度 → rewards-to-go → advantage → GAE

1 策略梯度

定义策略梯度:

$\nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} \, \mathbb{E}{\tau \sim \pi{\theta}} \left[ R(\tau) \right] \newline \newline =\mathbb{E}{\tau \sim \pi{\theta}} \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \cdot R(\tau) \right]\newline\approx \frac{1}{|\mathcal{D}|} \sum_{\tau\in\mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta}\log\pi_{\theta}(a_t\mid s_t)\; \cdot R(\tau)$

2 减小方差:Rewards-to-Go

$\nabla_{\theta} J(\pi_\theta)\newline\approx\frac{1}{|\mathcal D|}\sum_{\tau\in\mathcal D} \sum_{t=0}^{T} \nabla_{\theta}\log\pi_\theta(a_t \mid s_t)\;\cdot \underbrace{\sum_{k=0}^{T-t}\gamma^{\,k}\,r_{t+k}}_{G_t}$

3 减小方差:引入 Advantage