Tag: attention
All the articles with the tag "attention".
-
Attention Residuals: Softmax Attention Over Depth
A deep dive into the Kimi team's Attention Residuals (AttnRes) — replacing the fixed-weight residual connection with learned softmax attention over depth. Covers the time–depth duality, Full vs Block AttnRes, the structured-matrix view that unifies prior residual variants, the pipeline-parallel infra that makes it practical, and the scaling-law and 48B-MoE results.
-
Hybrid Attention and MLA: The Tradeoff
A side-by-side dive into Xiaomi MiMo's hybrid sliding-window/global attention and DeepSeek's Multi-head Latent Attention. The two answer the same question — how to make attention affordable at long context — with very different bets, and those bets shape everything from training infra to KV cache size.
-
Inside DeepSeek's Sparse Attention: From NSA to DSA
A deep dive into DeepSeek's two sparse attention designs — Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA) — covering the math, the hardware story, and why DSA in V3.2 looks so different from NSA.