Tag: attention
All the articles with the tag "attention".
-
Inside GLM-5.2: IndexShare, KVShare, and the End-to-End TV Loss
A deep dive into GLM-5.2 — a 753B open-weight MoE that serves a 1M-token context. We walk the three innovations that make it cheap to run: IndexShare (cross-layer sparse-attention index reuse), KVShare + rejection sampling for speculative decoding, and a novel end-to-end TV loss that breaks the entropy bound on MTP acceptance. Plus the slime RL stack behind its long-horizon agentic skills.
-
Attention Residuals: Softmax Attention Over Depth
A deep dive into the Kimi team's Attention Residuals (AttnRes) — replacing the fixed-weight residual connection with learned softmax attention over depth. Covers the time–depth duality, Full vs Block AttnRes, the structured-matrix view that unifies prior residual variants, the pipeline-parallel infra that makes it practical, and the scaling-law and 48B-MoE results.
-
Hybrid Attention and MLA: The Tradeoff
A side-by-side dive into Xiaomi MiMo's hybrid sliding-window/global attention and DeepSeek's Multi-head Latent Attention. The two answer the same question — how to make attention affordable at long context — with very different bets, and those bets shape everything from training infra to KV cache size.
-
Inside DeepSeek's Sparse Attention: From NSA to DSA
A deep dive into DeepSeek's two sparse attention designs — Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA) — covering the math, the hardware story, and why DSA in V3.2 looks so different from NSA.