On-policy distillation and reinforcement learning, these two sit much closer together than the names suggest. In fact the cleanest way to understand on-policy distillation is: it is reinforcement learning, with one specific choice of reward. Let me build that up rigorously.
Shared setup
Both operate on a policy — for LLMs, an autoregressive model generating tokens given a prefix . Both are policy-gradient methods that optimize an expected objective over trajectories the model generates itself (that’s the “on-policy” part).
Reinforcement learning (the general object)
Objective: maximize expected reward.
REINFORCE / policy-gradient estimator, with per-token advantages :
PPO clips the importance ratio ; GRPO replaces a learned value baseline with group-normalized advantages .
The defining property: is a scalar, often delivered once per trajectory (a verifier returns 0/1, a reward model returns one number). The gradient nudges only for the single sampled token . Signal per token ≈ one scalar smeared across the whole sequence by a hard credit-assignment problem.
On-policy distillation
Same on-policy sampling — roll out trajectories from the student — but the per-token reward is the negative reverse-KL to a teacher :
So the objective is
and in practice the discount factor is set to : because the teacher grades every token immediately, there’s no need to propagate credit across the horizon. Each visited state is optimized independently.
The key derivation that connects them
The reverse-KL term has a closed-form, differentiable gradient (unlike a black-box reward). At a fixed state, with teacher-fixed ():
(The entropy-of-self term vanishes because .)
Now compare term-by-term with the RL gradient:
| per-token “advantage” | summed over | how obtained | |
|---|---|---|---|
| RL | scalar | only sampled token | sampled / estimated |
| On-policy distillation | all in vocab | closed-form expectation |
They are the same algebraic object — a per-candidate weight multiplying . RL gives you one noisy scalar on one token; distillation gives you, in closed form, “how much more the teacher likes token than you do,” evaluated over the entire vocabulary at every position. Dense vs. sparse, teacher-distribution vs. scalar.
Similarities
- Both are on-policy. Trajectories are sampled from the current , so both train on the model’s own state distribution and both fix exposure bias / compounding error. (Behavior-cloning-style training has error in horizon ; training on-policy brings it to — the classic DAgger argument applies to both.)
- Same optimization machinery. Both are policy-gradient ascent over self-generated rollouts. On-policy distillation drops directly into a PPO/GRPO loop — you just swap the reward function for the per-token negative reverse-KL. Same sampler + trainer infrastructure.
- Both pay the rollout cost. Generation (the expensive part) is required by both.
- Both support per-token credit assignment / advantages (distillation just usually doesn’t need it, ).
- Formally nested: on-policy distillation ⊂ policy-gradient RL, with .
Differences
- Reward source & density. RL: a scalar from a verifier / reward model / human, typically sparse (one number per trajectory). Distillation: a full teacher distribution over the vocab at every token — orders of magnitude more bits of supervision per rollout.
- Differentiability of reward. RL’s reward is a black box → you’re stuck with the high-variance score-function estimator. The KL “reward” is computable and differentiable in → far lower variance, no reward estimation.
- Credit assignment. RL’s central pain: which token in a 10k-token chain earned the final 0/1? Needs many rollouts, baselines, GAE. Distillation grades each token directly → credit assignment essentially solved, hence large sample/compute-efficiency gains (often cited as ~1–2 orders of magnitude in recent write-ups).
- Ceiling / what’s learnable. RL can exceed any existing model — reward is the only ceiling, and exploration can discover genuinely new strategies. On-policy distillation is imitation: its fixed point is on the visited support, so it’s upper-bounded by the teacher. It explores its own state space but never toward novel capability — the target is always “be the teacher here.”
- Requirements. Distillation needs a competent teacher whose per-token logits are accessible. RL needs only a reward signal (which can be a cheap programmatic verifier, no teacher policy at all).
- Failure modes. RL is prone to reward hacking (exploiting a proxy reward model’s flaws). A full-distribution reverse-KL is much harder to game — but the student faithfully inherits the teacher’s biases and errors.
- KL direction matters. On-policy distillation uses reverse KL — mode-seeking / zero-forcing, which pairs naturally with on-policy sampling and is minimized only at true agreement. Off-policy distillation and ordinary SFT effectively minimize forward KL on teacher data (mass-covering), which is where exposure bias creeps back in.
The unifying 2×2
It’s cleanest to organize the whole post-training family by (where the data comes from) × (how rich the signal is):
| Dense signal (per-token distribution) | Sparse signal (scalar) | |
|---|---|---|
| Off-policy (fixed/teacher data) | Off-policy distillation / SFT on soft labels | Offline RL (sparse) |
| On-policy (student’s own rollouts) | On-policy distillation | RL (RLHF / RLVR) |
On-policy distillation occupies the missing-corner sweet spot: it takes RL’s on-policy sampling (which kills exposure bias) and replaces RL’s sparse, hackable, high-variance scalar reward with distillation’s dense, low-variance teacher target. The cost is that you give up RL’s ability to surpass the teacher and discover new behavior.
A practical reading: use RL when you have a verifiable reward and want to push past existing models; use on-policy distillation when a strong teacher exists and you want its behavior transferred cheaply and robustly into a student — getting most of RL’s distribution-matching benefit at a fraction of the rollouts.
Variance of the two gradient estimators
The cleanest way to state the relationship is exact, not hand-wavy: on-policy distillation’s gradient is the Rao-Blackwellized version of the RL score-function gradient. Everything else follows.
Fix a state . Both methods are estimating the same per-state quantity
(recall from before that this expectation is at ).
The two estimators differ only in how they handle the action expectation:
- RL / score-function: draw one , return . Unbiased: .
- Distillation / enumerated: compute exactly.
Notice — it is the sampled estimator with the action analytically integrated out, conditioning on the state. Law of total variance gives the whole story:
The two share the outer term (both roll out from , so both inherit state-visitation variance). Distillation eliminates the inner term exactly. By Rao-Blackwell this is guaranteed to be a variance reduction, never an increase — equality only in the degenerate case where the within-state contribution is already constant.
Why the trick is available to distillation but not to general RL. The enumerated sum costs one teacher forward pass (softmax hands you all logits at once) plus the student pass you were already doing. You do not pay . In a black-box-reward RL problem you cannot Rao-Blackwellize cheaply: evaluating for every would require reward queries or rollouts per state. The teacher’s differentiable per-token distribution is precisely what makes integrating out the action tractable. That single fact is the engine behind the “1–2 orders of magnitude fewer rollouts” claims.
And real RL is worse than this clean bound suggests, because in practice is not the nice . It’s a separately estimated scalar advantage carrying its own noise , and it’s usually a single terminal reward broadcast to all tokens. So the real RL estimator has two extra variance sources on top of the inner action-variance term: reward-estimation noise and the credit-assignment ambiguity of attributing one terminal scalar across positions. The terminal-reward broadcast makes the summed-gradient variance grow with horizon. None of these exist in distillation: the teacher logprobs are deterministic given , and every token is graded locally.
If you track signal-to-noise ratio , distillation strictly dominates per rollout — same numerator, smaller denominator. That higher SNR is the “variance budget” it later gets to spend on rollout reuse (§3).
Why \gamma=0 is formally justified
In RL you keep because reward is sparse: with terminal-only reward, would zero out the learning signal at every non-terminal token and you’d learn nothing. Propagating the terminal reward backward is the problem. So why is the opposite — pure myopia — correct for distillation?
Write the genuine on-policy distillation objective with the student’s own state-visitation distribution :
The honest subtlety — the thing would normally capture — is that depends on : changing your action at changes which future states you visit, and hence future KL costs. Product rule:
There are two complementary justifications for dropping the second term.
(a) Fixed-point consistency / the dropped term is benign. The global optimum is at every reachable state, which simultaneously zeroes every . The dropped visitation term is weighted by itself, so as you approach the optimum it vanishes — first-order in the KL residual — while the myopic term remains the dominant driver. The two objectives share the same stationary point, and the myopic gradient is an asymptotically aligned descent direction. Myopic optimization cannot be pulled away from the right answer, because at the answer the term it ignores is exactly zero. Contrast sparse-reward RL, where dropping future credit changes the fixed point entirely (you’d converge to “do nothing”).
(b) Detached sampling makes exact, not approximate. What you actually optimize on-policy is a surrogate that samples states from a detached prior policy (à la PPO / DAgger):
With the visitation distribution held fixed per outer iteration, this is genuinely a sum of independent per-state KL minimizations — there is no future to discount, because the future-visitation dependence has been pushed entirely into the outer loop. is the correct setting for the surrogate that on-policy iteration actually optimizes, and refreshing each round drives as a fixed point of the iteration. This is the DAgger move exactly: {roll out under current policy; supervise toward the expert at visited states; repeat}. Exposure bias dies not through long-horizon backups but through re-rollout — the state distribution self-corrects across iterations.
So the deep reason: the teacher reward is locally complete. It specifies the target behavior at without reference to the future, so there is nothing for a backup operator to propagate. (One caveat to avoid overclaiming: this is not literally potential-based reward shaping of the task reward — distillation’s objective “match the teacher” only coincides with the task objective if the teacher is task-optimal. The justification above is the product-rule/detached-surrogate one, which holds regardless.)
Importance sampling for rollout reuse
Both methods are on-policy, so both pay for fresh rollouts. IS is the shared escape hatch, and the standard identity is
with the stale behavior policy that generated the cached rollouts.
In RL this is literally PPO. The per-token ratio is an IS weight, and it’s what licenses taking several gradient steps on one batch of rollouts. Clipping, , caps the weight to bound variance/bias as drifts from . Note PPO deliberately uses per-step ratios (with the advantage carrying the future) rather than the full-trajectory product , whose variance grows exponentially in horizon — the classic reason naive trajectory IS is unusable for long sequences.
In distillation there’s a genuinely interesting asymmetry. Decompose what an IS weight is correcting for. In RL it corrects two things at once: (a) the action you happened to sample at , and (b) the trajectory that led you to . In enumerated distillation, (a) is gone — you sum over all actions analytically, you never sampled one — so the only mismatch left to correct is the state-visitation drift:
| what the IS weight corrects | form of the weight | |
|---|---|---|
| RL (PPO) | sampled action and visitation | per-token , advantage carries future |
| On-policy distillation | visitation only | prefix product ; action expectation enumerated |
In practice, if you take only a few off-policy steps and the policy hasn’t drifted, people drop even the visitation weight (set it to 1) and treat slightly-stale rollouts as on-policy — valid when the trust region keeps small over the rollout. And distillation can afford more reuse than RL: its underlying per-step estimator already has far lower variance (§1), so it has variance budget to spend on staleness. Conversely, because each distillation token carries full-distribution signal, a single rollout is so information-rich that the need for aggressive reuse is lower in the first place.
The teacher-as-behavior-policy case ties it back to KL direction. Push reuse to the extreme and sample from , the teacher itself — this is “off-policy distillation” viewed as IS-corrected on-policy distillation, with . The correction is unbiased in principle, but explodes exactly on the states where student and teacher disagree — which are precisely the states that matter for learning. That is the formal reason teacher-sampled distillation underperforms: not that it’s wrong, but that the IS correction needed to make it on-policy has catastrophic variance on the disagreement set. And if you omit the correction entirely — that’s just SFT on teacher data — you silently switch to the biased forward-KL / mass-covering objective and exposure bias walks back in (connecting to the KL-direction point from the first message).
Shared variance-control toolkit. Both lean on the same machinery: weight clipping (PPO) or truncation (TIS, v-trace’s clipped / in IMPALA), effective sample size to decide when cached rollouts are too stale to trust and must be refreshed, and a trust-region constraint to bound IS variance. The bias-variance dial is identical in both: more gradient steps per batch → cheaper but staler → larger weights → more variance, clipped back at the cost of bias.
The through-line across all three: distillation’s dense, differentiable teacher signal lets you analytically integrate out the action (§1 Rao-Blackwell), which makes the per-state problem locally complete so you don’t need horizon credit (§2 ), and the resulting low variance is what buys you tolerance for aggressive rollout reuse (§3 IS). RL forfeits all three because its reward is a sparse black-box scalar — and gets, in exchange, the one thing distillation can’t have: the ability to exceed the teacher.