Tag: speculative-decoding
All the articles with the tag "speculative-decoding".
-
Inside GLM-5.2: IndexShare, KVShare, and the End-to-End TV Loss
A deep dive into GLM-5.2 — a 753B open-weight MoE that serves a 1M-token context. We walk the three innovations that make it cheap to run: IndexShare (cross-layer sparse-attention index reuse), KVShare + rejection sampling for speculative decoding, and a novel end-to-end TV loss that breaks the entropy bound on MTP acceptance. Plus the slime RL stack behind its long-horizon agentic skills.