Reasoning-model RL post-training has, in two years, converged on a tiny family of policy-gradient algorithms. PPO sat at the top in the InstructGPT / ChatGPT era. GRPO (Group Relative Policy Optimization), introduced in DeepSeekMath (arXiv:2402.03300) and famously used to train DeepSeek-R1 (arXiv:2501.12948), replaced PPO as the de-facto default — it dropped the value network, simplified the loss, and made long-form reasoning RL practical. Then in 2025 the Qwen team published GSPO (Group Sequence Policy Optimization) (arXiv:2507.18071) used to train the Qwen3 series, arguing that GRPO’s token-level importance ratio is fundamentally wrong for sequence-level reward and is the reason large MoE runs keep blowing up.
This post is a complete mathematical walkthrough of both algorithms, the failure mode that motivated the move from GRPO to GSPO, and what changes (and what doesn’t) when you swap one for the other.
Table of contents
Open Table of contents
- Preliminaries: importance sampling and PPO
- Part 1 — Group Relative Policy Optimization (GRPO)
- Part 2 — Why GRPO breaks (especially on MoE)
- Part 3 — Group Sequence Policy Optimization (GSPO)
- Part 4 — GRPO vs GSPO, side by side
- Part 5 — When to use which
- Part 6 — A worked example
- Part 7 — Subtleties and gotchas
- Closing
Preliminaries: importance sampling and PPO
To make the GRPO / GSPO story self-contained, it helps to ground both in the policy gradient + importance sampling lineage they descend from.
Policy gradient with off-policy data
We want to maximize the expected reward of a stochastic policy :
The score-function gradient is . In LLM RL the trajectory is the response generated for a prompt , and the per-step decomposition reads
where is some advantage estimate at token .
Running this strictly on-policy — generate a batch, take one gradient step, throw it away, regenerate — is wasteful: LLM rollouts are the most expensive part of the pipeline. So we collect data with a behaviour policy (usually a frozen snapshot of from a few steps ago) and reuse it for several gradient updates. The bias correction is importance sampling:
The IS ratio is unbiased in expectation but high-variance whenever the two distributions disagree. The whole design space of PPO-style methods is about taming that variance without destroying the unbiasedness too badly.
PPO’s clipped surrogate
PPO (Schulman et al., 2017) defines a per-token IS ratio
and optimizes the clipped surrogate
The clip caps the optimization signal from any token whose ratio drifts too far from 1, which is PPO’s variance-control mechanism. is typically Generalized Advantage Estimation (GAE) computed with a learned value function :
In LLM RL, the reward is almost always sparse — zero on every token except the last, where the reward model (or a verifier) produces a scalar. So is basically the discounted bootstrapped value, and the value network is doing the heavy lifting.
That value network is exactly what GRPO removes.
Part 1 — Group Relative Policy Optimization (GRPO)
GRPO was introduced as a tweak to PPO to make math RL tractable: instead of training a value function whose target is a tiny reward signal at the end of long reasoning traces, generate a group of completions for the same prompt and use their reward statistics to define a baseline. The value function disappears; only the policy and a frozen reference model remain.
The objective
For each prompt , sample a group of responses from the old policy . Score each with a reward function — in DeepSeekMath, is a learned reward model; in DeepSeek-R1, is a rule-based verifier that returns 1 if the boxed answer matches the ground truth and 0 otherwise.
Define the per-token IS ratio exactly as PPO:
Compute a group-normalized advantage shared by every token in response :
Then maximize
Read off the design choices:
- No value network. The baseline is the group mean reward. The advantage is the same for every token in a response — no per-token credit assignment.
- Per-token IS ratio. Same as PPO. Each token has its own clip.
- KL is an explicit loss term, not added to the reward. The reference is usually the SFT model from before RL began. is a small constant (typical values to ).
- Group-then-token averaging. The outer sum is normalized by group size ; the inner sum is normalized by response length . So short and long responses contribute equally per response, and every token within a response contributes equally per token.
The KL term — k3 unbiased estimator
The KL above is estimated unbiasedly and non-negatively using the k3 estimator introduced by John Schulman:
Why this and not the naive plug-in ?
- The naive estimator for is unbiased for KL but can be negative, which makes the loss interpretation noisy.
- The k3 form where uses , which is non-positive everywhere (since ), so its negation is non-negative.
- It has lower variance because the sign is fixed.
The gradient of this estimator flows only through (the reference is frozen). At initialization so the KL term is zero; it grows as the policy drifts.
Where the advantage comes from — outcome vs process
DeepSeekMath actually presents two GRPO variants:
- Outcome supervision (GRPO-O). A single scalar reward per response, normalized within the group. Every token in gets the same . This is the form used in DeepSeek-R1.
- Process supervision (GRPO-P). A process reward model emits a reward at the end of each reasoning step (an “end-of-step” token). Normalize these step rewards across the group, then accumulate to each token:
- Iterative GRPO. Periodically re-train the reward model on samples from the current policy and reset to the latest snapshot. Useful when the reward model’s distribution shifts a lot during RL.
In practice the most-used form by far is the outcome-supervised one, because the rule-based verifier in math/code RL gives a clean 0/1 signal that needs no process model.
Why drop the value network?
Three reasons, in roughly decreasing order of importance:
- The value target is unstable. For long reasoning traces (think 8k–32k tokens), the only non-zero reward is at the very end. A learned must regress this single scalar back through thousands of tokens, with credit assignment that is essentially a hope. The bias of GAE explodes with .
- It’s a second model the size of the policy. That doubles activation memory during training, and for the giant frontier models it is a serious constraint. (Yes you can shrink the value head, but in PPO-for-LLMs the value model is usually as big as or close to the size of the policy backbone.)
- You can get a low-variance baseline almost for free by sampling responses per prompt and using the group mean. The variance of is , so or already removes most of the constant baseline variance, and crucially the baseline is unbiased with respect to the policy.
The tradeoff you accept is no per-token credit assignment. Every token in a winning trajectory gets the same positive advantage; every token in a losing trajectory gets the same negative advantage. Including the tokens that had nothing to do with why the trajectory was good or bad. GRPO’s bet is that the clipped importance ratio per token, applied to a uniform advantage, still concentrates updates on tokens that actually moved between and — implicitly assigning credit through the ratio’s magnitude.
One subtle point: where the average lives
Different open-source GRPO implementations (TRL, OpenRLHF, veRL, DeepSpeed-Chat) disagree on the exact normalization. The DeepSeekMath paper writes the objective as per-token averaged within a response, then averaged across the group:
Some implementations average across all tokens in the batch instead:
The two are equivalent up to a per-response weight that depends on . The first form gives long and short responses equal weight at the response level; the second gives every token equal weight, which downweights short responses. This sounds like a footnote — it is not. With long-response RL, where some completions run to 16k+ tokens and others terminate in 200, the choice shifts which kinds of behaviour get reinforced. (DAPO, an early GRPO variant from ByteDance, advocates for the token-level form for exactly this reason.)
Practical knobs
The hyperparameters that actually matter in practice:
- Group size . Larger groups give better baselines but cost more rollouts per prompt. is the typical range. DeepSeek-R1 used for the final RL run on long-context reasoning.
- Clip range . Same as PPO. is the inherited default, but some open recipes use asymmetric clips (, ) to permit slightly more exploration on the upside.
- KL coefficient . Small. to . DeepSeek-R1 sets to zero for the final RL run (no KL anchor), letting the policy drift freely — viable because rule-based rewards are not gameable in the way an RM is.
- Old-policy refresh cadence. How many gradient steps to take per rollout batch before re-sampling. Higher → more reuse of expensive rollouts but larger IS ratio drift. Typical is to .
What a GRPO step looks like end-to-end
For one outer iteration:
- Sample. Pick a batch of prompts . For each prompt, sample responses from .
- Score. Compute for every response. Compute group-normalized advantages .
- Forward. Run and on every to get per-token log-probs.
- Loss. Compute the clipped surrogate per token, add the k3 KL penalty per token, average per the chosen normalization scheme.
- Backward, step. Update . Repeat steps 3–5 for epochs over the same sampled batch.
- Refresh. Set , go to step 1.
That is GRPO in full. Now the failure mode that prompted GSPO.
Part 2 — Why GRPO breaks (especially on MoE)
The Qwen GSPO paper opens with a strong claim: GRPO’s per-token IS ratio is ill-posed for sequence-level rewards, and the resulting variance is so severe that large MoE RL runs become unstable without aggressive interventions like Routing Replay. The argument is worth reconstructing carefully because it generalizes beyond MoE.
The unit-of-importance mismatch
The reward is computed once, at the sequence level — is a function of the whole response. The natural quantity to importance-sample is therefore the sequence likelihood ratio:
A single sample of this ratio is unbiased for in the usual IS sense. GRPO does something different. It pulls the product apart and treats each token’s ratio as if it were correcting the sampling distribution of that token alone:
This is not an IS-correction for the sequence-level reward. From the perspective of any individual token, a single sample of is a single-sample IS estimate of the marginal . The Qwen paper’s framing: per-token IS with one sample per token is noise rather than correction. It does not reduce the bias of any meaningful estimator; it injects variance.
That variance is fine if the per-token ratios stay near 1 — clipping then keeps the per-token contribution bounded and the average is roughly the on-policy gradient. The trouble is that as response length grows and as the model gets larger, more and more tokens drift far enough from 1 to be clipped, and the clipped part of the loss has zero gradient. So the effective batch size of the policy gradient shrinks as training progresses.
The MoE-specific catastrophe: expert routing drift
This is where the failure becomes loud. In an MoE model, each token activates a subset of experts chosen by the router. After a single gradient step, the router changes, so the same may activate a different set of experts under than it did under . Even if the underlying expert weights are nearly identical, the conditional probability can change discontinuously because it now depends on a different set of expert outputs.
Empirically the Qwen team reports: after a single optimization step on a 48-layer MoE with 8/128 experts active, roughly 10% of tokens see their activated experts change. That means the per-token IS ratio for those tokens is essentially random — it reflects an expert-routing flip, not a meaningful policy change.
If you keep GRPO’s per-token formulation, the loss becomes a sum of such random ratios, and on a long response some fraction of them produce clipping or extreme gradients. Training diverges.
The Qwen-1 era workaround was Routing Replay: cache the router decisions made by at rollout time, and force the same expert routes during the training forward pass under . This keeps the per-token IS ratio meaningful because the same expert weights are being compared. But:
- It costs memory (you store the entire routing trace per token).
- It costs flexibility (the router cannot improve via the RL loss in the same way the rest of the model does).
- It blocks improvements to MoE capacity (e.g. you cannot increase the number of activated experts).
- It is a workaround. It does not address the underlying mismatch.
GSPO’s framing: don’t paper over the routing flip with replay. Fix the unit of importance sampling so that single-token routing changes wash out.
A second symptom: the long-response sink
There is a more general version of the problem that does not require MoE. As response length grows:
- The product of token ratios has variance that grows roughly exponentially in when ratios drift even slightly from 1.
- Even the arithmetic mean that GRPO actually uses gets dominated by a few outlier tokens whose ratios are extreme.
- Clipping bounds the surrogate’s value but not the gradient’s variance, because the clip only fires on individual tokens — many tokens just below the clip threshold still contribute oversized updates.
The result: long-response training is fragile under GRPO. You either lower the learning rate, shrink the group, shorten the responses, or eat instabilities. None of those scale.
Part 3 — Group Sequence Policy Optimization (GSPO)
GSPO’s central change is one line:
Apply importance sampling at the sequence level, then clip and optimize at the sequence level.
Everything else — group-relative advantages, no value model, KL anchor against the reference — stays the same. The implementations are nearly identical at the data-pipeline level. Only the loss differs.
The sequence-level importance ratio
Define
Two things to notice:
- It is the sequence likelihood ratio raised to — equivalently, the exponential of the mean log-ratio over tokens. This length-normalization is what keeps at the same scale across responses of wildly different length. Without the exponent, would have variance growing with (the unit-of-measure stops being comparable across responses), and the single sequence-level clip range would not make sense.
- It is computed from the same per-token log-probs as GRPO — no extra forward pass. The difference is entirely in how those log-probs are aggregated.
The GSPO objective
With the sequence-level ratio in hand, the loss is the PPO clip applied once per sequence:
That is it. The advantage is the same group-normalized scalar GRPO uses; the per-token loop is gone, the per-token clip is gone. Each entire response either contributes its full surrogate term or is clipped as a sequence.
A small but important consequence: the clip range needs to be much smaller than GRPO’s. A typical GRPO is , meaning a per-token ratio in . For the geometric-mean sequence ratio over hundreds of tokens, is enormous — you essentially never clip. The Qwen paper uses around 3e-4 for the lower bound and 4e-4 for the upper, asymmetric. (These look strange the first time you see them; the right intuition is that they are bounding the average per-token log-ratio, which is what is, so a few thousandths is a lot.)
Why this fixes the MoE problem
Under GSPO, the IS unit is the sequence, so a single token’s routing flip changes by of the flip’s log-ratio. Average a few thousand tokens together and individual flips average out. The mean log-ratio over the whole sequence stays close to 0 even if a fraction of tokens have switched expert routes, as long as the aggregate sequence likelihood under and is close.
Concretely: if 10% of tokens have meaningfully different expert routes and each contributes a log-ratio of order , the per-sequence mean log-ratio is on the order of in magnitude — well controllable by a 3e-4 clip range after most of the noise has averaged out within the sequence. Compare this to GRPO, where each of those flipped tokens individually clears the per-token clip and zero-gradients itself out of the loss.
The Qwen team’s reported outcome on Qwen3-30B-A3B MoE: GSPO trains stably without Routing Replay, while GRPO requires Routing Replay to not diverge and still trains less efficiently. They were also able to remove Routing Replay from production, simplifying both the training stack and the RL infrastructure.
GSPO-token: a per-token gradient under sequence-level clipping
There is an interesting subtlety the paper raises. GSPO’s loss is sequence-level, which means every token in a sequence shares the same scalar weight ( multiplied by the same clipped ). For settings like multi-turn RL where you might want to mask certain tokens (assistant-only loss, tool-output tokens excluded, etc.), the sequence form doesn’t directly let you re-weight individual tokens.
GSPO-token addresses this with a careful rewrite. Define a per-token quantity
where is the stop-gradient operator. In the forward pass this is exactly for every token in response . In the backward pass, the gradient of with respect to is
So the value of the surrogate equals GSPO’s sequence-level surrogate (gradient-aware clipping behaves the same), but the gradient flows through each token’s own log-prob, weighted by a sequence-level scalar. Now you can apply a per-token mask to the loss and have the masked gradient be exactly what you want:
This is exactly GSPO when and (verify: every token contributes the same surrogate value, and the sum/normalization recovers ). It is strictly more flexible — you can have per-token advantages (process supervision) or per-token masks (multi-turn) — while preserving GSPO’s sequence-level clipping behaviour.
What stays the same
It is worth being explicit about how little of the GRPO recipe changes:
- Group sampling. Still responses per prompt from .
- Advantage. Still .
- Reference KL. Still optional, still computed with the k3 estimator if used. The Qwen paper’s main runs use it; the structure of the KL term is unchanged.
- No value network. Same.
- Rollout / training loop. Same — same prompts, same generations, same forward passes. Different aggregation in the loss.
The implementation diff between a GRPO trainer and a GSPO trainer is on the order of a few dozen lines.
Part 4 — GRPO vs GSPO, side by side
| Question | GRPO | GSPO |
|---|---|---|
| IS unit | Per token | Per sequence (geometric mean of per-token ratios) |
| Clip unit | Per token, range 0.2 | Per sequence, range 3e-4 |
| Advantage | Group-normalized, broadcast to every token | Group-normalized, applied once per response |
| Per-token credit | Yes (via per-token ratio magnitude) | No, by default — recoverable via GSPO-token |
| MoE routing drift | Each flipped token contributes a noisy ratio | Averages out within the sequence |
| Long responses | Variance grows; clipping fires often | Stable; length-normalization keeps bounded |
| Routing Replay needed for MoE | Yes (in practice) | No |
| Value network | None | None |
| Reference KL | Yes (k3 estimator) | Yes (same form) |
| Implementation diff vs PPO | Big (no critic, group baselines) | Bigger (also sequence-level loss) |
A useful one-liner: GRPO replaces the critic with a group; GSPO additionally replaces the per-token surrogate with a per-sequence one. They both throw away a piece of standard PPO machinery. GRPO throws away the value head. GSPO throws away the per-token IS unit.
Another way to see GSPO: it is the natural answer to the question “what is the largest unit you can importance-sample with a single sample?” — for sequence-level rewards, that unit is the sequence, and any attempt to split it into smaller per-token units injects noise without adding information. GRPO’s per-token surrogate is essentially borrowed from PPO, where it was designed for per-step rewards in MDPs (Atari, MuJoCo). LLM RL with end-of-sequence verifier rewards is not that kind of MDP.
Part 5 — When to use which
Some practical heuristics from open recipes:
- Dense model, short responses (under 2k tokens), 0/1 verifier reward. GRPO is fine. The per-token IS unit’s variance is manageable at short length, the implementation is mature, and you save the per-step migration cost. Most math-RL recipes (GSM8K, MATH) live in this regime.
- Dense model, long responses (8k–32k tokens, long-CoT). GRPO works but you’ll want token-level loss normalization, possibly asymmetric clip ranges (DAPO-style), and tight monitoring of clip rates. GSPO is a worth-trying-once switch — the diff is small and the stability win is real.
- MoE model, any context length. GSPO. Strongly. The routing-flip story is real; people have wasted entire H100 clusters on MoE GRPO instability. The Qwen team’s reported result — removing Routing Replay entirely — is on its own enough reason to start with GSPO for MoE RL.
- Multi-turn RL with token masking. GSPO-token. The per-token formulation in the GSPO-token variant gives you the flexibility GSPO lacks while preserving sequence-level IS.
- Process supervision available. GRPO-P or GSPO-token (with per-token advantages). The choice between them is the same as the dense-vs-MoE question above.
Part 6 — A worked example
Consider one response of length with token log-probabilities under the two policies:
| per-token log-ratio | |||
|---|---|---|---|
| 1 | |||
| 2 | |||
| 3 | |||
| 4 |
Assume this response received a normalized advantage , and clip range where applicable.
GRPO surrogate (no clipping for simplicity):
But notice the token at has ratio , well inside a typical clip range — it would be clipped to when (the clip rule selects the lower value because higher with positive advantage is what we want to bound). Token has ratio , also clipped (to ). The clipped per-token average becomes . Two out of four tokens contribute zero gradient.
GSPO surrogate:
The per-token ratios cancelled out. The sequence-level ratio is essentially 1, no clipping fires, and the full sequence contributes to the loss. The fact that individual tokens drifted in either direction is irrelevant — the sequence as a whole was sampled from a policy that is still close to the current one.
Now flip the scenario: imagine and both happened to land on tokens whose expert routing has flipped under . The per-token log-ratios might be and . Under GRPO both of those tokens would get heavily clipped and zero-gradient. Under GSPO, shifts by — completely invisible at the sequence level. The MoE-routing intuition in concrete numbers.
Part 7 — Subtleties and gotchas
A few details that bite people implementing these for the first time.
1. Sampling and training must use exact-same logits
PPO/GRPO/GSPO all assume ‘s log-probs from rollout time exactly match the log-probs that would be computed by the training-time framework if it ran again. In practice, sampling uses an inference engine (vLLM, SGLang, TRT-LLM) and training uses a training framework (Megatron, DeepSpeed, FSDP). They can disagree in the last bits of fp16/bf16 due to different kernels, different attention implementations, etc. Those last-bit differences appear in the IS ratio as systematic bias and can quietly destabilize training. Common fixes: log-prob a second time inside the training framework rather than trusting the inference engine’s log-probs, or align numerical kernels across the two. Under GSPO this matters less because per-token differences get averaged, but it is still worth being aware of.
2. The KL term’s gradient direction
The k3 estimator estimates , gradient w.r.t. . Adding it as a negative term in the maximization objective penalizes drift from . Some implementations get the sign wrong (or compute KL in the other direction by mistake) and the policy is then pushed away from the reference, which empirically looks like the model “improving” for a few steps and then collapsing. Read the sign carefully.
3. Per-token advantages need to live in the right place
If you adopt GSPO-token to mix in process supervision, the gradient analysis we did relies on being inside the /clip with . A common bug is to compute the clipped sequence ratio outside and multiply it onto per-token advantages downstream — that recovers GSPO’s gradient, not GSPO-token’s intended per-token-credit gradient. The placement matters.
4. Clip-fraction monitoring
For both algorithms, the fraction of tokens (GRPO) or sequences (GSPO) whose ratio falls outside the clip range is the single most informative training-time diagnostic. If GRPO’s per-token clip fraction is you are likely past the rollout’s useful life — either refresh more aggressively, lower the learning rate, or take fewer epochs per rollout. For GSPO the analogous sequence clip-fraction stays small as long as the per-sequence average log-ratio is well-controlled; runaway clip fraction is a sign that is too tight or the learning rate is too high.
5. Group size and reward variance
Group-normalized advantage assumes the group has nonzero reward variance. If all responses to a prompt get the same reward (typically: all wrong or all right), , the advantage is undefined, and the loss for that prompt is degenerate. Common fixes: drop those prompts from the batch (this is what most implementations do), or use an additive constant in the denominator. Both are fine; what is not fine is to silently produce NaNs.
6. Reference-free GSPO/GRPO
DeepSeek-R1 famously runs with in the final RL stage — no reference KL at all. This is viable when (a) the reward is a non-gameable rule-based verifier and (b) the reward signal is rich enough to constrain the policy to coherent outputs. For RM-based rewards (DPO-style preference RMs), you almost certainly want to prevent reward-model exploitation. GSPO inherits the same logic.
Closing
The arc from PPO → GRPO → GSPO is what gradual specialization to LLM RL looks like. PPO came from the deep-RL toolbox with all the per-step machinery intact: value network, per-token IS ratio, per-token clip. GRPO threw out the piece that was clearly broken for sparse-reward long-horizon tasks (the value network) and replaced it with the cheapest thing that worked (a group baseline). GSPO threw out the next piece that was clearly broken once group RL met MoE at scale (the per-token IS ratio) and replaced it with the cheapest thing that worked (a sequence-level IS unit).
Both moves trade off some theoretical generality for a lot of practical stability. Both moves are tied to specific features of LLM RL — sparse end-of-sequence rewards, long responses, large models that are expensive to forward. Neither move is the last one. If you are designing the next post-training algorithm in 2026 the question to ask is: which piece of inherited PPO machinery, now that I look at it with fresh eyes, doesn’t actually fit the regime I’m in?
References
- Shao, Z. et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300, 2024. (GRPO)
- DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025.
- Zheng, C. et al. Group Sequence Policy Optimization. arXiv:2507.18071, 2025. (GSPO)
- Schulman, J. et al. Proximal Policy Optimization Algorithms. arXiv:1707.06347, 2017.
- Schulman, J. Approximating KL Divergence. http://joschu.net/blog/kl-approx.html.
- Yu, Q. et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476, 2025.