Posts
All the articles I've posted.
-
From GRPO to GSPO: Group-Based Policy Optimization for LLMs
A complete walkthrough of Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO) — the policy-gradient algorithms behind DeepSeek-R1 and Qwen3. Full math, the failure mode that motivated GSPO, the MoE story, and a side-by-side comparison.
-
GRPO and Dr.GRPO: The Math, the Biases, and the Fix
An end-to-end derivation of Group Relative Policy Optimization (GRPO) from DeepSeekMath and the Dr.GRPO correction from Liu et al. Covers the full objective, the gradient, the two biases (length and question difficulty), the unbiased fix, and the practical recipe behind R1-Zero–style training.
-
Training Composer 2: How Cursor Builds a Coding Agent Model
A structured walkthrough of Sasha Rush's Training Composer 2 workshop: why Cursor chose Kimi K2.5, how continued pretraining and long-horizon RL fit together, what CursorBench measures, and where Composer is headed.
-
Leetcode Problems
Leetcode grinding.