Tag: post-training
All the articles with the tag "post-training".
-
GRPO and Dr.GRPO: The Math, the Biases, and the Fix
An end-to-end derivation of Group Relative Policy Optimization (GRPO) from DeepSeekMath and the Dr.GRPO correction from Liu et al. Covers the full objective, the gradient, the two biases (length and question difficulty), the unbiased fix, and the practical recipe behind R1-Zero–style training.