Inside GLM-5.2: IndexShare, KVShare, and the End-to-End TV Loss

On 2026-06-19 Z.ai (Zhipu) shipped GLM-5.2, a 753B-parameter open-weight Mixture-of-Experts model released under the MIT license. It tops the independent Artificial Analysis Intelligence Index and — more interestingly for this post — it does something most frontier models still treat as aspirational: it serves a stable 1M-token context (up from 200K in GLM-5.1) and stays fast and cheap while doing it.

The headline benchmarks are a coding-and-agents story (it beats GPT-5.5 on several long-horizon coding benchmarks at roughly 1/6 the price). But the engineering that makes those numbers affordable is what’s worth studying. GLM-5.2 leans on three tightly-coupled ideas, two of which were spun out as standalone arXiv papers:

IndexShare — cross-layer reuse of the sparse-attention indexer, so 1M-token attention stops re-deciding which tokens to attend to in every layer. (IndexCache, arXiv:2603.12201)
KVShare + rejection sampling + an end-to-end TV loss — a rebuilt Multi-Token-Prediction (MTP) stack for speculative decoding that stays fast even under the high-entropy regime of RL. (Bebop, arXiv:2606.12370)
The slime agentic-RL stack — parallel on-policy distillation to merge a dozen expert models, critic-based PPO for long-horizon tasks, and an online anti-hacking guard.

This post walks all three with the math, the figures, and the code.

Open Table of contents

The base: MLA + DeepSeek Sparse Attention
IndexShare: reuse the index across layers
- Picking which layers keep an indexer
- What it buys
KVShare + rejection sampling + the end-to-end TV loss
Training: slime, OPD merging, and long-horizon RL
Benchmarks
Takeaways

The base: MLA + DeepSeek Sparse Attention

GLM-5.2 inherits the now-standard frontier recipe: a deep MoE transformer with Multi-head Latent Attention (MLA) for a compressed KV cache, plus DeepSeek Sparse Attention (DSA) for sub-quadratic long-context attention. The blog and model card disclose the total size (753B, BF16 safetensors, glm_moe_dsa architecture) and the 1M context, but not the full per-layer config; the academic companion paper reports the immediately-prior GLM-5 at 744B, so 5.2 is a same-class MoE.

DSA is the relevant background, so a quick recap. Standard softmax attention costs $O(L^2)$ . DSA inserts a cheap lightning indexer before each attention op: for query position $t$ at layer $\ell$ it produces a score vector $\mathbf{I}_t^{(\ell)} \in \mathbb{R}^{L}$ over all preceding tokens, keeps only the top- $k$ (GLM/DSA use $k = 2048$ ),

\mathcal{T}_t^{(\ell)} = \text{Top-}k\!\left(\mathbf{I}_t^{(\ell)}\right), \qquad \mathbf{q}_t^{(\ell)} = \mathrm{softmax}\!\left(\mathbf{I}_t^{(\ell)}\right),

and runs full-resolution attention only over those $k$ tokens. Core attention drops from $O(L^2)$ to $O(Lk)$ . (For the full NSA→DSA story see the earlier post on DeepSeek’s sparse attention.)

The catch — and the opening GLM-5.2 exploits — is that the indexer itself is still $O(L^2)$ and runs independently in every layer. At $L = 1\text{M}$ , recomputing a million-wide relevance score in all $N$ layers becomes the dominant cost, even though the core attention it gates is now cheap.

IndexShare: reuse the index across layers

The key empirical observation (from the IndexCache paper, same Zhipu/Tsinghua group) is that the top- $k$ selections are highly similar across consecutive layers. Adjacent layers keep re-deciding to attend to nearly the same earlier tokens. So why pay for the indexer $N$ times?

IndexShare partitions the $N$ layers into two roles, encoded as a binary pattern $\mathbf{c} = c_1 c_2 \cdots c_N$ with $c_\ell \in \{\texttt{F}, \texttt{S}\}$ :

Full (F) layers run their own indexer and compute a fresh $\mathcal{T}_t^{(\ell)}$ .
Shared (S) layers have no indexer; they inherit the index set from the nearest preceding Full layer:

\mathcal{T}_t^{(\ell)} \leftarrow \mathcal{T}_t^{(f(\ell))}, \qquad f(\ell) = \max\{\, j < \ell : c_j = \texttt{F} \,\}.

In GLM-5.2 the productized setting is one indexer shared across every 4 layers (a 1/4 retention pattern), which the blog reports cuts per-token FLOPs by 2.9× at 1M context. The attention pattern stays adaptive — Full layers still choose freely — the model just stops repeatedly re-deciding what to attend to. Side-by-side, the inference loops differ by a tiny branch:

(a) Standard DSA                          (b) IndexCache / IndexShare
for ℓ = 1..N:                             for ℓ = 1..N:
    I  ← Indexerℓ(X)            # O(L²)        if cℓ == F:                 # Full layer
    T  ← Top-k(I)                                 I ← Indexerℓ(X)         # O(L²)
    X  ← SparseAttnℓ(X, T)     # O(Lk)            T ← Top-k(I)
    X  ← FFNℓ(X)                                  T_cache ← T
                                              else:                       # Shared layer
                                                  T ← T_cache             # reuse, O(1)
                                              X ← SparseAttnℓ(X, T)       # O(Lk)
                                              X ← FFNℓ(X)

The total indexer cost goes from $O(NL^2)$ to $O(N_{\texttt{F}} L^2)$ while the $O(NLk)$ core attention is untouched. At 1/4 retention you delete 75% of the indexer compute.

Picking which layers keep an indexer

There are two ways to choose the pattern $\mathbf{c}$ .

Training-free (greedy search). Given an already-trained DSA model, greedily flip Full→Shared, each time choosing the flip that least increases language-modeling loss on a small calibration set. Layer 1 is always kept Full.

Algorithm 1 — Greedy IndexCache pattern search
Input : DSA model M (N layers), calibration batches 𝒟, target #Shared layers K
Output: pattern c*
1.  c ← Fᴺ                                  # start all-Full
2.  ℛ ← {2, 3, …, N}                        # layer 1 stays Full
3.  for step = 1 … K:
4.      ℓ* ← argmin_{ℓ∈ℛ} EvalLoss(M, 𝒟, c with cℓ→S)
5.      c_{ℓ*} ← S ;  ℛ ← ℛ \ {ℓ*}
6.  return c

This requires no weight updates and recovers almost all quality: on the 30B DSA model the greedy 1/4 pattern restores the long-context average from 43.0 (naïve uniform interleaving) back to 49.9, against a full-indexer baseline of 50.2.

Training-aware (multi-layer distillation). When you can train (from scratch or continued pre-training — as GLM-5.2 does), you can do better than hoping the indices transfer. In standard DSA each indexer is distilled against its own layer’s aggregated attention distribution $\mathbf{p}_t^{(\ell)}$ . IndexShare instead trains each retained indexer to serve all the layers that will reuse it. If Full layer $\ell$ serves Shared layers $\ell{+}1,\dots,\ell{+}m$ :

\mathcal{L}^{\mathrm{I}}_{\mathrm{multi}} = \sum_{j=0}^{m} \frac{1}{m+1} \sum_{t} D_{\mathrm{KL}}\!\left( \mathbf{p}_t^{(\ell+j)} \,\big\|\, \mathbf{q}_t^{(\ell)} \right).

A clean result (Proposition 1 in the paper) is that this is gradient-equivalent to distilling against the averaged attention target $\bar{\mathbf{p}}_t = \frac{1}{m+1}\sum_{j=0}^{m}\mathbf{p}_t^{(\ell+j)}$ :

\nabla_\theta \,\mathcal{L}^{\mathrm{I}}_{\mathrm{multi}} = \nabla_\theta \sum_t D_{\mathrm{KL}}\!\left( \bar{\mathbf{p}}_t \,\big\|\, \mathbf{q}_t^{(\ell)} \right).

(Proof is one line: $\mathbf{p}$ is detached, so $\nabla_\theta D_{\mathrm{KL}}(\mathbf{p}\|\mathbf{q}) = -\nabla_\theta\sum_s \mathbf{p}(s)\log\mathbf{q}(s)$ is linear in $\mathbf{p}$ , and the sum of linear terms equals the term at the average.) The retained indexer is thus trained to predict a consensus top- $k$ that’s jointly useful for every layer it serves — which is why training-aware IndexShare makes even a simple uniform interleave match the full-indexer baseline, removing the pattern-sensitivity that the training-free route had to search around.

What it buys

Bar chart of relative speedup of IndexCache over the DSA baseline across prefill, per-request decode, and full decode throughput, at 1/2 and 1/4 retention; speedups grow with context length up to 1.82x prefill and 1.48x decode at 200K. — Figure from Bai et al. (2026), arXiv:2603.12201. Relative speedup of IndexCache over the DSA baseline (=100%) on the 30B model across prefill latency, per-request decode, and full decode throughput. Gains grow with context length.

On a 30B DSA model served in SGLang on an H100 node, deleting 75% of the indexers (1/4 retention) yields, at 200K tokens:

Context	Config	Prefill (s) ↓	Decode/req (tok/s) ↑	Decode full (tok/s) ↑
200K	DSA baseline	19.5	58.0	197
200K	IndexCache `1/2`	13.7	73.0	253
200K	IndexCache `1/4`	10.7	86.0	297

That’s 1.82× prefill and 1.48× decode at 200K — and the advantage only widens as context grows toward 1M, which is exactly the regime GLM-5.2 targets. The paper’s preliminary runs on the production 744B GLM-5 confirm it scales:

Grouped bars comparing GLM-5 vs GLM-5 + IndexCache across long-context and reasoning benchmarks, showing near-identical scores while removing half the indexer computations. — Figure 1 from Bai et al. (2026), arXiv:2603.12201. GLM-5 vs GLM-5 + IndexCache. Removing 50% of indexer computations holds long-context and reasoning quality while delivering ~1.2–1.3× end-to-end speedup at production scale.

KVShare + rejection sampling + the end-to-end TV loss

The second pillar is GLM-5.2’s rebuilt Multi-Token Prediction (MTP) stack for speculative decoding. This is where IndexShare, KVShare, rejection sampling, and the end-to-end TV loss combine — and the cleanest way to motivate them is to first see why naïve MTP gets slow exactly when you need it most: during RL.

MTP and the entropy bound

In MTP speculative decoding, $\gamma$ lightweight draft heads propose candidate tokens $\hat y_{t+1},\dots,\hat y_{t+\gamma}$ with distribution $q(\cdot)$ , and the target model verifies them in a single forward pass with distribution $p(\cdot)$ . The expected number of tokens accepted per step — the thing you want to maximize — is

\mathbb{E}[L] = \sum_{j=1}^{\gamma} \prod_{i=1}^{j} \alpha_i, \qquad \alpha_i = \text{per-step acceptance rate}.

The Bebop paper (Qwen team) asks whether MTP can accelerate RL training — where rollouts dominate end-to-end time — and finds a nasty obstacle: the acceptance rate is fundamentally bounded by the target model’s entropy, and RL deliberately raises entropy to encourage exploration. The two factors usually blamed (draft/target distribution mismatch from weight updates) turn out to be secondary; the dominant driver is the entropy fluctuation $\mathcal{H}(p) = -\sum_v p(v)\log p(v)$ .

Scatter of mean policy entropy (x) vs MTP accept length (y) across many RL steps and model sizes. Target-only sampling shows a steep negative linear trend; rejection sampling with e2e TV loss is much flatter and higher. — Figure 1(a) from Li et al. (2026), arXiv:2606.12370. Each point is the mean entropy and accept length at one RL step across Qwen3.5/3.6/3.7 runs. Target-only acceptance degrades linearly with policy entropy; rejection sampling + the e2e TV loss largely removes the entropy dependence.

Why does entropy bite? It depends on how you verify.

Target-only (greedy) sampling. Pick the draft’s $\arg\max$ and accept it with the target’s probability there. For a well-trained draft this gives

\alpha^{\mathrm{TO}} = \max_y p(y),

which is monotonically decreasing in $\mathcal{H}(p)$ — by Jensen, $\max_y p(y) \ge \exp(-\mathcal{H}(p))$ — and empirically near-linear,

\alpha^{\mathrm{TO}} \approx a^{\mathrm{TO}} - b^{\mathrm{TO}}\cdot \mathcal{H}(p), \qquad b^{\mathrm{TO}} > 0.

So as RL pushes entropy up, greedy acceptance falls off a cliff.

Probabilistic rejection sampling. Draw $\hat y \sim q$ and accept with probability $\min(1, p(\hat y)/q(\hat y))$ . The expected acceptance rate is the full distributional overlap:

\alpha^{\mathrm{RS}} = \mathbb{E}_{\hat y\sim q}\!\left[\min\!\left(1,\tfrac{p(\hat y)}{q(\hat y)}\right)\right] = \sum_y \min\big(p(y), q(y)\big) = 1 - d_{\mathrm{TV}}(p,q),

where $d_{\mathrm{TV}}(p,q) = \tfrac12\sum_y|p(y)-q(y)|$ is the Total Variation distance. This is unbiased (the output distribution is exactly $p$ , regardless of draft quality) and — crucially — it is governed by how well $q$ overlaps $p$ , not by how peaked $p$ is. That decouples it from entropy. So swapping greedy verification for rejection sampling is step one, and it’s already a large win in the high-entropy RL regime.

Why CE/KL is the wrong training loss here

If acceptance under rejection sampling is $1 - d_{\mathrm{TV}}(p,q)$ , then you should train the draft to minimize TV distance. But conventional MTP heads are trained with cross-entropy / forward-KL, $D_{\mathrm{KL}}(p\|q)$ . By Pinsker’s inequality,

d_{\mathrm{TV}}(p,q) \le \sqrt{\tfrac12 D_{\mathrm{KL}}(p\|q)},

KL is only a loose upper bound on TV — minimizing it doesn’t efficiently minimize the quantity that actually sets your acceptance rate. The gradient structure makes the difference concrete. CE/KL has gradient $\partial D_{\mathrm{KL}}/\partial z_j = q_j - p_j$ : a uniform per-token mismatch that spends optimization budget on the entire vocabulary, including the irrelevant long tail. That uniform mismatch accumulates over an effective support of size $\approx \exp(\mathcal{H}(p))$ — which is precisely why CE-trained acceptance also ends up entropy-dependent.

The TV loss and its end-to-end form

So train against TV directly. The single-step TV loss is

\mathcal{L}_{\mathrm{TV}} = d_{\mathrm{TV}}(p,q) = 1 - \sum_{v\in\mathcal{V}} \min\big(p(v), q(v)\big),

with $p$ detached. Its gradient is

\frac{\partial \mathcal{L}_{\mathrm{TV}}}{\partial z_j} = -\,q_j\big[\mathbf{1}[q_j \le p_j] - S\big], \qquad S = \sum_v \mathbf{1}[q_v \le p_v]\,q_v,

which is proportional to $q_j$ — it concentrates updates on tokens the draft already cares about and ignores the tail. This produces a probability-proportional mismatch $|q^*(v) - p(v)| \lesssim \delta\cdot p(v)$ instead of a uniform one, and that’s what decouples acceptance from entropy. It’s also a bounded gradient ( $|\partial\mathcal{L}_{\mathrm{TV}}/\partial z_j| \le 1$ ), unlike KL’s $q_j - p_j$ which can blow up when $q$ and $p$ disagree — so it trains more stably.

Now the “end-to-end” part. Acceptance length is a product of per-step rates $\prod_i \alpha_i$ , so optimizing the average single-step TV distance ignores the multiplicative structure (early-step errors kill every downstream term). Bebop’s end-to-end (e2e) TV loss optimizes the normalized expected acceptance length directly:

\mathcal{L}_{\mathrm{e2e}} = 1 - \frac{1}{\gamma}\sum_{j=1}^{\gamma}\prod_{i=1}^{j}\alpha_i = 1 - \frac{1}{\gamma}\sum_{j=1}^{\gamma}\prod_{i=1}^{j}\big(1 - d_{\mathrm{TV}}(p_i, q_i)\big).

Because each $\alpha_i$ appears in every product term $j \ge i$ , earlier steps are weighted more heavily — and since the $\alpha_i$ depend on current draft quality, it’s effectively a dynamic step-weighting that automatically shifts emphasis to whichever step is currently bottlenecking acceptance. No hand-tuned per-head loss weights.

Here is the whole thing in PyTorch — it’s strikingly small:

import torch
import torch.nn.functional as F

def tv_distance(p, q):
    # p, q: [..., V] probability distributions; p is detached (target)
    # d_TV = 1 - sum_v min(p, q)
    return 1.0 - torch.minimum(p, q).sum(dim=-1)

def e2e_tv_loss(target_logits, draft_logits_per_step):
    """
    target_logits: [B, T, V]                      -- target model, detached
    draft_logits_per_step: list of gamma tensors  -- one [B, T, V] per MTP head
    Returns the end-to-end (normalized expected accept-length) TV loss.
    """
    p = F.softmax(target_logits, dim=-1).detach()       # stop-grad through target
    gamma = len(draft_logits_per_step)

    alphas = []                                          # per-step accept rate alpha_i
    for q_logits in draft_logits_per_step:
        q = F.softmax(q_logits, dim=-1)
        alphas.append(1.0 - tv_distance(p, q))           # alpha_i = 1 - d_TV(p_i, q_i)

    # expected normalized accept length = (1/gamma) * sum_j prod_{i<=j} alpha_i
    cum, acc = 1.0, 0.0
    for j in range(gamma):
        cum = cum * alphas[j]                            # prod_{i=1..j} alpha_i
        acc = acc + cum
    return 1.0 - acc / gamma                             # minimize -> maximize E[L]

One practical wrinkle Bebop flags: the TV min(p, q) is a full-vocabulary operation, and truncating it to a top- $K$ to save memory backfires — small $K$ causes loss spikes and instability, and even $K=20{,}000$ converges slower than the full-vocab loss. Pay for the full softmax here.

Per-MTP-step acceptance comparison: CE loss (solid) vs TV loss (dashed) across Math, Code and MT-Bench. TV is consistently higher at every step, with the gap widening at later steps. — Figure from Li et al. (2026), arXiv:2606.12370. CE loss (solid) vs e2e TV loss (dashed) over SFT training; TV achieves higher acceptance at every MTP step, with the largest gains on later steps and agentic tasks.

On Qwen3.5-35A3B with $\gamma=3$ , switching CE→e2e TV lifts rejection-sampling acceptance across the board:

MTP loss	Math	Code	SWE	Agent	MT-Bench
CE (baseline)	75.0	71.3	75.1	90.3	65.3
e2e TV (ours)	+3.0	+3.3	+8.0	+6.7	+2.3

and the absolute numbers climb to up to ~95%+ at scale (e.g. Qwen3.6-Plus hits 99.1 on Agent). End to end, Bebop reports up to 25% extra inference throughput and 1.5–1.8× faster RL training (up to 2.4× on agentic RL), purely from a lightweight pre-RL MTP training phase — no MTP co-training during RL needed.

KVShare: closing the train/inference gap in GLM-5.2’s MTP

GLM-5.2 productizes all of this and adds KVShare, plus applies IndexShare to the MTP module itself. Two design choices matter:

KVShare. In GLM-5.1 the MTP module’s KV cache was populated from mixed sources; in GLM-5.2 the MTP KV cache holds only the hidden states of the target model. The draft head therefore verifies against exactly the representation the target produces, eliminating a train/inference distribution gap that was quietly suppressing acceptance.
IndexShare on MTP. The indexer is placed on the first MTP step, and its top- $k$ indices are reused for all subsequent draft steps — the same cross-layer trick, now applied across draft steps, so the draft heads stay cheap.

The ablation (acceptance length, higher is better) shows each piece compounding:

Configuration	Acceptance length
Baseline	4.56
+ IndexShare + KVShare	5.10
+ Rejection Sampling	5.29
+ End-to-end TV loss	5.47 (+20%)

That +20% acceptance length is the headline MTP number in the GLM-5.2 blog, and you can read off exactly where it comes from: KVShare/IndexShare make the draft consistent and cheap, rejection sampling makes verification entropy-robust, and the e2e TV loss trains the draft to maximize the quantity rejection sampling actually rewards.

Training: slime, OPD merging, and long-horizon RL

GLM-5.2 is built for long-horizon agentic work, and the post-training stack is organized around that goal, on Zhipu’s open-source slime RL framework.

Expert merging via parallel OPD. Rather than training one monolith, the team trained more than ten expert models (each strong in a domain) and merged them into the final model with parallel On-Policy Distillation (OPD) — the whole merge took ~2 days. OPD is the dense, on-policy distillation objective (roll out from the student, supervise toward an expert teacher with a per-token reverse-KL); if you want the theory of why it’s so sample-efficient versus RL, see the OPD-vs-RL post. slime exposes the rollout interface needed for this: white-box and black-box rollout, compact trajectory (trajectory compaction for long episodes), and sub-agent workflow modes.

Critic-based PPO for long horizons. Group-relative methods (GRPO and friends) assume comparable, similar-length rollouts in a group — which breaks down once trajectory compaction produces variable-length fragments. GLM-5.2 shifts to a critic-based PPO formulation: learn from individual rollouts and use a learned critic for token-level advantage estimation, which handles ragged, compacted long-horizon traces that group-wise baselines can’t.

Online anti-hacking. Long-horizon agentic RL is fertile ground for reward hacking (e.g. an agent that games the verifier rather than solving the task). GLM-5.2 runs a two-stage guard: a fast rule-based filter flags candidate hacks, then an LLM judge checks intent. Detected actions are blocked online while the rollout continues — so a single hacked action doesn’t poison the trajectory or destabilize training, and you don’t have to throw the whole rollout away.

Serving the 1M context. On the inference side, beyond IndexShare the engine adds finer-grained memory management and parallelism built on LayerSplit, kernel optimizations for context-dependent operations, and CPU-side cache management — and the throughput advantage grows as context length increases, which is the right shape for a 1M-token model.

Benchmarks

The full comparison from the GLM-5.2 release (bold = best open weight where applicable; * denotes figures reported with tools/caveats in the original table):

Benchmark	GLM-5.2	GLM-5.1	Qwen3.7-Max	MiniMax M3	DeepSeek-V4-Pro	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
HLE	40.5	31	41.4	37	37.7	49.8*	41.4*	45
HLE (w/ Tools)	54.7	52.3	53.5	—	48.2	57.9*	52.2*	51.4*
CritPt	16.7	4.6	13.4	3.7	12.9	20.9	27.1	17.7
AIME 2026	99.2	95.3	97	—	94.6	95.7	98.3	98.2
HMMT Nov. 2025	94.4	94	95	84.4	94.4	96.5	96.5	94.8
HMMT Feb. 2026	92.5	82.6	97.1	84.4	95.2	96.7	96.7	87.3
IMOAnswerBench	91.0	83.8	90	—	89.8	83.5	—	81
GPQA-Diamond	91.2	86.2	90	93	90.1	93.6	93.6	94.3
SWE-bench Pro	62.1	58.4	60.6	59	55.4	69.2	58.6	54.2
NL2Repo	48.9	42.7	47.2	42.1	35.5	69.7	50.7	33.4
DeepSWE	46.2	18	18	20	8	58	70	10
ProgramBench	63.7	50.9	—	—	47.8	71.9	70.8	39.5
Terminal Bench 2.1 (Terminus-2)	81.0	63.5	75	65	64	85	84	74
Terminal Bench 2.1 (Best Harness)	82.7	69	—	—	—	78.9	83.4	70.7
FrontierSWE (Dominance)	74.4	30.5	—	—	29.0	75.1	72.6	39.6
PostTrainBench	34.3	20.1	—	—	—	37.2	28.4	21.6
SWE-Marathon	13.0	1.0	—	—	—	26.0	12.0	4.0
MCP-Atlas (Public)	76.8	71.8	76.4	74.2	73.6	77.8	75.3	69.2
Tool-Decathlon	48.2	40.7	—	—	52.8	59.9	55.6	48.8

The story the table tells: GLM-5.2 is at or near the closed-source frontier on math (AIME 2026 99.2) and competitive on agentic coding (Terminal Bench 2.1 81.0, FrontierSWE 74.4 — a 2.4× jump over GLM-5.1’s 30.5), trailing Claude Opus 4.8 on the hardest SWE marathons but doing so as an open-weight MIT model at a fraction of the serving cost. The generation-over-generation deltas (e.g. DeepSWE 18→46.2, SWE-Marathon 1.0→13.0) are where the long-horizon RL stack shows up.

Takeaways

GLM-5.2’s three innovations rhyme: each one identifies a quantity that was being recomputed or mis-optimized, and fixes it at the source.

IndexShare notices the sparse-attention indexer was redundantly re-deciding the same top- $k$ in every layer, and shares it across layers — $2.9\times$ fewer per-token FLOPs at 1M context, validated up to 744B.
Rejection sampling + the e2e TV loss notice that MTP acceptance was being capped by entropy and trained against the wrong divergence (KL instead of TV) — fixing both makes speculative decoding survive the high-entropy RL regime, and is what lets MTP accelerate RL training rather than just inference.
KVShare notices the MTP draft was verifying against a slightly-wrong representation, and feeds it the target’s own hidden states.

Put together — and wrapped in slime’s OPD merging, critic-based PPO, and online anti-hacking — they’re why a 753B open-weight model can serve a stable, cheap 1M-token context and top the agentic-coding leaderboards. The weights and the two companion papers are all public, which is the best part: you can read exactly how it was done.

Sources. GLM-5.2 technical blog (Z.ai) · IndexCache, arXiv:2603.12201 · Bebop / e2e TV loss, arXiv:2606.12370 · GLM-5.2 weights on Hugging Face