Tip

Group Relative Policy Optimization(GRPO) was first introduced in DeepSeekMath(in 2024 Feb) but received much wider recognition after DeepSeek R1’s success.

GRPO(𝜃)=𝔼[𝑞𝑃(𝑄)]

In contrast to methods like PPO, GRPO foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question 𝑞, GRPO samples a group of outputs {𝑜1,𝑜2,,𝑜𝐺} from the old policy model 𝜋𝜃old and then optimizes the policy model 𝜋𝜃 by maximizing the following objective:

GRPO(𝜃)=𝐸[𝑞𝑃(𝑄),{𝑜𝑖}𝐺𝑖=1𝜋𝜃old)(𝑂|𝑞)]=1𝐺𝐺𝑖=1[min{𝜋𝜃(𝑜𝑖|𝑞)𝜋𝜃old(𝑜𝑖|𝑞)𝐴𝑖,clip(𝜋𝜃(𝑜𝑖|𝑞)𝜋𝜃old(𝑜𝑖|𝑞),1𝜀,1+𝜀)𝐴𝑖}𝛽𝔻KL[𝜋𝜃(𝑜𝑖|𝑞)𝜋ref(𝑜𝑖|𝑞)]]

In DeepSeek series, KL divergence is approximated by the following unbiased estimator(Schulman, 2020):

𝔻KL(𝜋𝜃𝜋ref)=𝜋ref(𝑜𝑖|𝑞)𝜋𝜃(𝑜𝑖|𝑞)log𝜋ref(𝑜𝑖|𝑞)𝜋𝜃(𝑜𝑖|𝑞)1

which is guaranteed to be non-negative.

𝐴𝑖 is the advantage, derived from the rewards {𝑟1,𝑟2,,𝑟𝐺} corresponding to the outputs within each group:

𝐴𝑖=𝑟𝑖mean({𝑟1,𝑟2,,𝑟𝐺})std({𝑟1,𝑟2,,𝑟𝐺}).