Difference Between On-Policy Distillation and Reinforcement Learning

On-policy distillation and reinforcement learning, these two sit much closer together than the names suggest. In fact the cleanest way to understand on-policy distillation is: it is reinforcement learning, with one specific choice of reward. Let me build that up rigorously.

Shared setup

Both operate on a policy $\pi_\theta(y\mid x)$ — for LLMs, an autoregressive model generating tokens $y_t$ given a prefix $s_t = (x, y_{<t})$ . Both are policy-gradient methods that optimize an expected objective over trajectories the model generates itself (that’s the “on-policy” part).

Reinforcement learning (the general object)

Objective: maximize expected reward.

$J(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi_\theta(\cdot\mid x)}\big[R(x,y)\big]$

REINFORCE / policy-gradient estimator, with per-token advantages $A_t$ :

$\nabla_\theta J = \mathbb{E}_{y\sim\pi_\theta}\Big[\sum_t A_t\,\nabla_\theta\log\pi_\theta(y_t\mid s_t)\Big]$

PPO clips the importance ratio $r_t=\tfrac{\pi_\theta}{\pi_{\theta_{old}}}$ ; GRPO replaces a learned value baseline with group-normalized advantages $A_i=\tfrac{R_i-\mathrm{mean}(R)}{\mathrm{std}(R)}$ .

The defining property: $R$ is a scalar, often delivered once per trajectory (a verifier returns 0/1, a reward model returns one number). The gradient nudges $\nabla_\theta\log\pi_\theta(y_t)$ only for the single sampled token $y_t$ . Signal per token ≈ one scalar smeared across the whole sequence by a hard credit-assignment problem.

On-policy distillation

Same on-policy sampling — roll out trajectories from the student $\pi_\theta$ — but the per-token reward is the negative reverse-KL to a teacher $p$ :

$r_t = -\,D_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid s_t)\,\|\,p(\cdot\mid s_t)\big) = -\sum_v \pi_\theta(v\mid s_t)\big[\log\pi_\theta(v\mid s_t) - \log p(v\mid s_t)\big]$

So the objective is

$\min_\theta\; \mathbb{E}_{y\sim\pi_\theta}\Big[\sum_t D_{\mathrm{KL}}\big(\pi_\theta(\cdot\mid s_t)\,\|\,p(\cdot\mid s_t)\big)\Big]$

and in practice the discount factor is set to $\gamma=0$ : because the teacher grades every token immediately, there’s no need to propagate credit across the horizon. Each visited state is optimized independently.

The key derivation that connects them

The reverse-KL term has a closed-form, differentiable gradient (unlike a black-box reward). At a fixed state, with $p$ teacher-fixed ( $\nabla_\theta\log p=0$ ):

$-\nabla_\theta D_{\mathrm{KL}}(\pi_\theta\|p) = \mathbb{E}_{v\sim\pi_\theta}\big[(\log p(v) - \log\pi_\theta(v))\,\nabla_\theta\log\pi_\theta(v)\big]$

(The entropy-of-self term vanishes because $\sum_v\pi_\theta(v)\nabla\log\pi_\theta(v)=\nabla\sum_v\pi_\theta(v)=0$ .)

Now compare term-by-term with the RL gradient:

	per-token “advantage”	summed over	how obtained
RL	scalar $A_t$	only sampled token $y_t$	sampled / estimated
On-policy distillation	$\log p(v)-\log\pi_\theta(v)$	all $v$ in vocab	closed-form expectation

They are the same algebraic object — a per-candidate weight multiplying $\nabla_\theta\log\pi_\theta$ . RL gives you one noisy scalar on one token; distillation gives you, in closed form, “how much more the teacher likes token $v$ than you do,” evaluated over the entire vocabulary at every position. Dense vs. sparse, teacher-distribution vs. scalar.

Similarities

Both are on-policy. Trajectories are sampled from the current $\pi_\theta$ , so both train on the model’s own state distribution and both fix exposure bias / compounding error. (Behavior-cloning-style training has error $O(\epsilon H^2)$ in horizon $H$ ; training on-policy brings it to $O(\epsilon H)$ — the classic DAgger argument applies to both.)
Same optimization machinery. Both are policy-gradient ascent over self-generated rollouts. On-policy distillation drops directly into a PPO/GRPO loop — you just swap the reward function for the per-token negative reverse-KL. Same sampler + trainer infrastructure.
Both pay the rollout cost. Generation (the expensive part) is required by both.
Both support per-token credit assignment / advantages (distillation just usually doesn’t need it, $\gamma=0$ ).
Formally nested: on-policy distillation ⊂ policy-gradient RL, with $R = -D_{\mathrm{KL}}(\pi_\theta\|p)$ .

Differences

Reward source & density. RL: a scalar from a verifier / reward model / human, typically sparse (one number per trajectory). Distillation: a full teacher distribution over the vocab at every token — orders of magnitude more bits of supervision per rollout.
Differentiability of reward. RL’s reward is a black box → you’re stuck with the high-variance score-function estimator. The KL “reward” is computable and differentiable in $\theta$ → far lower variance, no reward estimation.
Credit assignment. RL’s central pain: which token in a 10k-token chain earned the final 0/1? Needs many rollouts, baselines, GAE. Distillation grades each token directly → credit assignment essentially solved, hence large sample/compute-efficiency gains (often cited as ~1–2 orders of magnitude in recent write-ups).
Ceiling / what’s learnable. RL can exceed any existing model — reward is the only ceiling, and exploration can discover genuinely new strategies. On-policy distillation is imitation: its fixed point is $\pi_\theta = p$ on the visited support, so it’s upper-bounded by the teacher. It explores its own state space but never toward novel capability — the target is always “be the teacher here.”
Requirements. Distillation needs a competent teacher whose per-token logits are accessible. RL needs only a reward signal (which can be a cheap programmatic verifier, no teacher policy at all).
Failure modes. RL is prone to reward hacking (exploiting a proxy reward model’s flaws). A full-distribution reverse-KL is much harder to game — but the student faithfully inherits the teacher’s biases and errors.
KL direction matters. On-policy distillation uses reverse KL $D_{\mathrm{KL}}(\pi_\theta\|p)$ — mode-seeking / zero-forcing, which pairs naturally with on-policy sampling and is minimized only at true agreement. Off-policy distillation and ordinary SFT effectively minimize forward KL on teacher data (mass-covering), which is where exposure bias creeps back in.

The unifying 2×2

It’s cleanest to organize the whole post-training family by (where the data comes from) × (how rich the signal is):

	Dense signal (per-token distribution)	Sparse signal (scalar)
Off-policy (fixed/teacher data)	Off-policy distillation / SFT on soft labels	Offline RL (sparse)
On-policy (student’s own rollouts)	On-policy distillation	RL (RLHF / RLVR)

On-policy distillation occupies the missing-corner sweet spot: it takes RL’s on-policy sampling (which kills exposure bias) and replaces RL’s sparse, hackable, high-variance scalar reward with distillation’s dense, low-variance teacher target. The cost is that you give up RL’s ability to surpass the teacher and discover new behavior.

A practical reading: use RL when you have a verifiable reward and want to push past existing models; use on-policy distillation when a strong teacher exists and you want its behavior transferred cheaply and robustly into a student — getting most of RL’s distribution-matching benefit at a fraction of the rollouts.

Variance of the two gradient estimators

The cleanest way to state the relationship is exact, not hand-wavy: on-policy distillation’s gradient is the Rao-Blackwellized version of the RL score-function gradient. Everything else follows.

Fix a state $s$ . Both methods are estimating the same per-state quantity

$g(s) = \mathbb{E}_{v\sim\pi_\theta(\cdot\mid s)}\big[w_v\,\nabla_\theta\log\pi_\theta(v\mid s)\big],\qquad w_v = \log p(v\mid s)-\log\pi_\theta(v\mid s)$

(recall from before that this expectation is $-\nabla_\theta D_{\mathrm{KL}}(\pi_\theta\|p)$ at $s$ ).

The two estimators differ only in how they handle the action expectation:

RL / score-function: draw one $v\sim\pi_\theta$ , return $\hat g_{\text{samp}} = w_v\nabla_\theta\log\pi_\theta(v)$ . Unbiased: $\mathbb{E}[\hat g_{\text{samp}}\mid s]=g(s)$ .
Distillation / enumerated: compute $\hat g_{\text{enum}} = \sum_v \pi_\theta(v)\,w_v\,\nabla_\theta\log\pi_\theta(v) = g(s)$ exactly.

Notice $\hat g_{\text{enum}} = \mathbb{E}[\hat g_{\text{samp}}\mid s]$ — it is the sampled estimator with the action analytically integrated out, conditioning on the state. Law of total variance gives the whole story:

$\underbrace{\mathrm{Var}(\hat g_{\text{samp}})}_{\text{RL}} = \underbrace{\mathrm{Var}_s\big(g(s)\big)}_{\text{across visited states}} + \underbrace{\mathbb{E}_s\big[\mathrm{Var}_{v\mid s}(\hat g_{\text{samp}})\big]}_{\text{within-state action noise}},\qquad \underbrace{\mathrm{Var}(\hat g_{\text{enum}})}_{\text{distillation}} = \mathrm{Var}_s\big(g(s)\big)$

The two share the outer term (both roll out from $\pi_\theta$ , so both inherit state-visitation variance). Distillation eliminates the inner term exactly. By Rao-Blackwell this is guaranteed to be a variance reduction, never an increase — equality only in the degenerate case where the within-state contribution is already constant.

Why the trick is available to distillation but not to general RL. The enumerated sum costs one teacher forward pass (softmax hands you all $K$ logits at once) plus the student pass you were already doing. You do not pay $K\times$ . In a black-box-reward RL problem you cannot Rao-Blackwellize cheaply: evaluating $w_v$ for every $v$ would require $K$ reward queries or rollouts per state. The teacher’s differentiable per-token distribution is precisely what makes integrating out the action tractable. That single fact is the engine behind the “1–2 orders of magnitude fewer rollouts” claims.

And real RL is worse than this clean bound suggests, because in practice $w_v$ is not the nice $\log p - \log\pi$ . It’s a separately estimated scalar advantage carrying its own noise $\mathrm{Var}(A\mid v)$ , and it’s usually a single terminal reward broadcast to all $H$ tokens. So the real RL estimator has two extra variance sources on top of the inner action-variance term: reward-estimation noise and the credit-assignment ambiguity of attributing one terminal scalar across $H$ positions. The terminal-reward broadcast makes the summed-gradient variance grow with horizon. None of these exist in distillation: the teacher logprobs are deterministic given $s$ , and every token is graded locally.

If you track signal-to-noise ratio $\mathrm{SNR}=\|\mu\|^2/\mathrm{tr}(\mathrm{Cov})$ , distillation strictly dominates per rollout — same numerator, smaller denominator. That higher SNR is the “variance budget” it later gets to spend on rollout reuse (§3).

Why \gamma=0 is formally justified

In RL you keep $\gamma\approx1$ because reward is sparse: with terminal-only reward, $\gamma=0$ would zero out the learning signal at every non-terminal token and you’d learn nothing. Propagating the terminal reward backward is the problem. So why is the opposite — pure myopia — correct for distillation?

Write the genuine on-policy distillation objective with the student’s own state-visitation distribution $d^{\pi_\theta}$ :

$V(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta}}\big[\mathrm{KL}_s(\theta)\big],\qquad \mathrm{KL}_s(\theta)=D_{\mathrm{KL}}\big(\pi_\theta(\cdot\mid s)\,\|\,p(\cdot\mid s)\big)$

The honest subtlety — the thing $\gamma>0$ would normally capture — is that $d^{\pi_\theta}$ depends on $\theta$ : changing your action at $s_t$ changes which future states you visit, and hence future KL costs. Product rule:

$\nabla_\theta V = \underbrace{\mathbb{E}_{s\sim d^{\pi_\theta}}\big[\nabla_\theta \mathrm{KL}_s(\theta)\big]}_{\text{myopic term} = \text{what }\gamma=0\text{ uses}} \;+\; \underbrace{\sum_s \big(\nabla_\theta d^{\pi_\theta}(s)\big)\,\mathrm{KL}_s(\theta)}_{\text{visitation term, dropped}}$

There are two complementary justifications for dropping the second term.

(a) Fixed-point consistency / the dropped term is benign. The global optimum is $\pi_\theta=p$ at every reachable state, which simultaneously zeroes every $\mathrm{KL}_s$ . The dropped visitation term is weighted by $\mathrm{KL}_s(\theta)$ itself, so as you approach the optimum it vanishes — first-order in the KL residual — while the myopic term remains the dominant driver. The two objectives share the same stationary point, and the myopic gradient is an asymptotically aligned descent direction. Myopic optimization cannot be pulled away from the right answer, because at the answer the term it ignores is exactly zero. Contrast sparse-reward RL, where dropping future credit changes the fixed point entirely (you’d converge to “do nothing”).

(b) Detached sampling makes $\gamma=0$ exact, not approximate. What you actually optimize on-policy is a surrogate that samples states from a detached prior policy (à la PPO / DAgger):

$\mathcal{L}(\theta) = \mathbb{E}_{s\sim d^{\pi_{\theta_{\text{old}}}}}\big[D_{\mathrm{KL}}(\pi_\theta(\cdot\mid s)\,\|\,p)\big],\quad \text{stop-gradient through the sampler.}$

With the visitation distribution held fixed per outer iteration, this is genuinely a sum of independent per-state KL minimizations — there is no future to discount, because the future-visitation dependence has been pushed entirely into the outer loop. $\gamma=0$ is the correct setting for the surrogate that on-policy iteration actually optimizes, and refreshing $\theta_{\text{old}}$ each round drives $\pi_\theta\to p$ as a fixed point of the iteration. This is the DAgger move exactly: {roll out under current policy; supervise toward the expert at visited states; repeat}. Exposure bias dies not through long-horizon backups but through re-rollout — the state distribution self-corrects across iterations.

So the deep reason: the teacher reward is locally complete. It specifies the target behavior at $s_t$ without reference to the future, so there is nothing for a backup operator to propagate. (One caveat to avoid overclaiming: this is not literally potential-based reward shaping of the task reward — distillation’s objective “match the teacher” only coincides with the task objective if the teacher is task-optimal. The justification above is the product-rule/detached-surrogate one, which holds regardless.)

Importance sampling for rollout reuse

Both methods are on-policy, so both pay for fresh rollouts. IS is the shared escape hatch, and the standard identity is

$\mathbb{E}_{y\sim\pi_\theta}[f(y)] = \mathbb{E}_{y\sim\mu}\Big[\tfrac{\pi_\theta(y)}{\mu(y)}\,f(y)\Big],\qquad \rho(y)=\tfrac{\pi_\theta(y)}{\mu(y)},$

with $\mu$ the stale behavior policy that generated the cached rollouts.

In RL this is literally PPO. The per-token ratio $r_t=\pi_\theta(y_t\mid s_t)/\pi_{\theta_{\text{old}}}(y_t\mid s_t)$ is an IS weight, and it’s what licenses taking several gradient steps on one batch of rollouts. Clipping, $\min\!\big(r_tA_t,\ \mathrm{clip}(r_t,1\pm\epsilon)A_t\big)$ , caps the weight to bound variance/bias as $\pi_\theta$ drifts from $\pi_{\text{old}}$ . Note PPO deliberately uses per-step ratios (with the advantage carrying the future) rather than the full-trajectory product $\prod_t r_t$ , whose variance grows exponentially in horizon $H$ — the classic reason naive trajectory IS is unusable for long sequences.

In distillation there’s a genuinely interesting asymmetry. Decompose what an IS weight is correcting for. In RL it corrects two things at once: (a) the action you happened to sample at $t$ , and (b) the trajectory that led you to $s_t$ . In enumerated distillation, (a) is gone — you sum over all actions analytically, you never sampled one — so the only mismatch left to correct is the state-visitation drift:

	what the IS weight corrects	form of the weight
RL (PPO)	sampled action and visitation	per-token $r_t$ , advantage carries future
On-policy distillation	visitation only	prefix product $\prod_{t'<t}\rho_{t'}$ ; action expectation enumerated

In practice, if you take only a few off-policy steps and the policy hasn’t drifted, people drop even the visitation weight (set it to 1) and treat slightly-stale rollouts as on-policy — valid when the trust region keeps $D_{\mathrm{KL}}(\pi_{\text{old}}\|\pi_\theta)$ small over the rollout. And distillation can afford more reuse than RL: its underlying per-step estimator already has far lower variance (§1), so it has variance budget to spend on staleness. Conversely, because each distillation token carries full-distribution signal, a single rollout is so information-rich that the need for aggressive reuse is lower in the first place.

The teacher-as-behavior-policy case ties it back to KL direction. Push reuse to the extreme and sample from $\mu=p$ , the teacher itself — this is “off-policy distillation” viewed as IS-corrected on-policy distillation, with $\rho_t=\pi_\theta/p$ . The correction is unbiased in principle, but $\mathrm{Var}(\rho)$ explodes exactly on the states where student and teacher disagree — which are precisely the states that matter for learning. That is the formal reason teacher-sampled distillation underperforms: not that it’s wrong, but that the IS correction needed to make it on-policy has catastrophic variance on the disagreement set. And if you omit the correction entirely — that’s just SFT on teacher data — you silently switch to the biased forward-KL / mass-covering objective and exposure bias walks back in (connecting to the KL-direction point from the first message).

Shared variance-control toolkit. Both lean on the same machinery: weight clipping (PPO) or truncation (TIS, v-trace’s clipped $\rho$ / $c$ in IMPALA), effective sample size $\mathrm{ESS}=\big(\sum_i\rho_i\big)^2/\sum_i\rho_i^2$ to decide when cached rollouts are too stale to trust and must be refreshed, and a trust-region constraint $D_{\mathrm{KL}}(\pi_\theta\|\pi_{\text{old}})\le\delta$ to bound IS variance. The bias-variance dial is identical in both: more gradient steps per batch → cheaper but staler → larger weights → more variance, clipped back at the cost of bias.

The through-line across all three: distillation’s dense, differentiable teacher signal lets you analytically integrate out the action (§1 Rao-Blackwell), which makes the per-state problem locally complete so you don’t need horizon credit (§2 $\gamma=0$ ), and the resulting low variance is what buys you tolerance for aggressive rollout reuse (§3 IS). RL forfeits all three because its reward is a sparse black-box scalar — and gets, in exchange, the one thing distillation can’t have: the ability to exceed the teacher.