Almost every component of a modern transformer has, over the past few years, traded fixed behavior for learned, input-dependent behavior. Sequence mixing went from fixed convolutions to attention. Feed-forward capacity went from one dense MLP to input-routed experts. Even normalization has acquired learned scales and placements. One component sat untouched through all of it: the residual connection. The update

hl=hl1+fl1(hl1)\mathbf{h}_l = \mathbf{h}_{l-1} + f_{l-1}(\mathbf{h}_{l-1})

has been copied verbatim from ResNet into essentially every LLM, and it still aggregates layer outputs with fixed unit weights.

The Kimi team’s Attention Residuals (AttnRes, arXiv:2603.15031) is the argument that this last fixed component should be made learned and content-dependent too — and that the right way to do it is softmax attention over the depth axis. The framing is a duality: residual connections are to depth what RNNs were to sequence, and AttnRes is the same RNN→Transformer move applied to depth instead of time. This post walks the math, the structured-matrix theory that unifies it with prior residual variants, the infrastructure that makes it survive pipeline parallelism, and the experiments — including a 48B-parameter MoE pretrained on 1.4T tokens.

Table of contents

Open Table of contents

Why touch the residual connection at all?

Two roles of the residual connection are worth separating.

The famous one is the gradient highway. Backpropagating through the residual recurrence gives

Lhl=LhLj=lL1(I+fjhj),\frac{\partial \mathcal{L}}{\partial \mathbf{h}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \cdot \prod_{j=l}^{L-1}\left(\mathbf{I} + \frac{\partial f_j}{\partial \mathbf{h}_j}\right),

and expanding the product always leaves a bare I\mathbf{I} term — a direct path from the loss to any layer, regardless of depth. This is what lets us train deep networks at all, and AttnRes does not want to lose it.

The less-discussed role is as a depth-wise aggregation rule. Unroll the recurrence and the hidden state entering layer ll is

hl=h1+i=1l1fi(hi),\mathbf{h}_l = \mathbf{h}_1 + \sum_{i=1}^{l-1} f_i(\mathbf{h}_i),

i.e. the token embedding plus a uniformly weighted sum of every preceding layer output. Every layer receives the same flat average of everything below it. There is no mechanism to emphasize one earlier layer over another, and no way for an attention layer and an MLP layer to ask for different mixtures of the past.

Under PreNorm — the dominant placement in modern LLMs — this flat accumulation has a concrete pathology. Because each layer’s output is added (not normalized) into the stream, hl\lVert \mathbf{h}_l \rVert grows as O(L)O(L) with depth. The normalization sits before each sublayer, so each flf_l sees a unit-scale input but writes into an ever-larger residual. The relative contribution of any single layer therefore shrinks with depth — later layers must learn ever-larger outputs just to stay audible. This is PreNorm dilution, and it shows up empirically as the well-known result that you can prune a large fraction of an LLM’s middle layers with little loss: their contributions were being drowned out anyway.

The paper names three concrete limitations of single-state residuals (fixed or gated):

  1. No selective access. Different layer types (attention vs. MLP) get the same aggregated state, even when they would benefit from different weightings.
  2. Irreversible loss. Information blurred together by summation cannot be selectively recovered later.
  3. Output growth. Later layers inflate their outputs to gain influence over the accumulated residual, which destabilizes training.

These are exactly the symptoms RNNs had on the sequence axis before attention. Which is the whole idea.

The duality of time and depth

Lay the two recurrences side by side:

Sequence axis (RNN)Depth axis (Residual)
Statest\mathbf{s}_thl\mathbf{h}_l
Updatest=g(st1,xt)\mathbf{s}_t = g(\mathbf{s}_{t-1}, \mathbf{x}_t)hl=hl1+fl1(hl1)\mathbf{h}_l = \mathbf{h}_{l-1} + f_{l-1}(\mathbf{h}_{l-1})
Bottleneckone state compresses all past tokensone state compresses all past layers
Fixattention: selectively read all positionsAttnRes: selectively read all layers

On the sequence side, the Transformer’s answer to RNN compression was to let each position attend over all previous positions with data-dependent weights. AttnRes proposes the identical move over depth:

hl=α0lh1+i=1l1αilfi(hi),i=0l1αil=1,\mathbf{h}_l = \alpha_{0 \to l}\,\mathbf{h}_1 + \sum_{i=1}^{l-1} \alpha_{i \to l}\, f_i(\mathbf{h}_i), \qquad \sum_{i=0}^{l-1}\alpha_{i\to l} = 1,

where αil\alpha_{i\to l} are learned, input-dependent attention weights. The crucial observation that makes this affordable: while sequence length reaches millions of tokens, network depth is modestL<1000L < 1000 in any real model. So O(L2)O(L^2) attention over depth, which would be unthinkable over a long sequence, is essentially free.

Full Attention Residuals

Write the weights as a normalized kernel αil=ϕ(ql,ki)/jϕ(ql,kj)\alpha_{i\to l} = \phi(\mathbf{q}_l, \mathbf{k}_i) / \sum_j \phi(\mathbf{q}_l, \mathbf{k}_j). The choice of ϕ\phi determines which residual variant you get (more on this in the structured-matrix section); AttnRes picks the exponential kernel with a normalized key, giving plain softmax attention over depth:

ϕ(q,k)=exp ⁣(qRMSNorm(k)),αil=ϕ(ql,ki)j=0l1ϕ(ql,kj).\phi(\mathbf{q}, \mathbf{k}) = \exp\!\left(\mathbf{q}^\top \mathrm{RMSNorm}(\mathbf{k})\right), \qquad \alpha_{i \to l} = \frac{\phi(\mathbf{q}_l, \mathbf{k}_i)}{\sum_{j=0}^{l-1}\phi(\mathbf{q}_l, \mathbf{k}_j)}.

The queries, keys, and values are deliberately minimal:

ql=wl,ki=vi={h1i=0fi(hi)1il1,\mathbf{q}_l = \mathbf{w}_l, \qquad \mathbf{k}_i = \mathbf{v}_i = \begin{cases} \mathbf{h}_1 & i = 0 \\ f_i(\mathbf{h}_i) & 1 \le i \le l-1, \end{cases}

so the input to layer ll is just hl=i=0l1αilvi\mathbf{h}_l = \sum_{i=0}^{l-1} \alpha_{i\to l}\,\mathbf{v}_i.

Three design decisions deserve attention:

  • The query is a bare learned vector wlRd\mathbf{w}_l \in \mathbb{R}^d, one per layer — a “pseudo-query.” There is no query projection of the current hidden state. This is the entire parameter cost of AttnRes: one dd-vector per layer plus one RMSNorm. It looks like a limitation but is the key to the infrastructure story below: because wl\mathbf{w}_l doesn’t depend on layer ll‘s forward computation, the attention weights for a whole group of layers can be computed in parallel, without waiting for each layer’s output in sequence.
  • Keys and values are the same thing — the raw layer outputs. No separate key/value projections.
  • RMSNorm sits inside the kernel, on the keys. Without it, a layer that naturally produces large-magnitude outputs would dominate the softmax purely on scale. Normalizing the keys forces the selection to be about direction/content, not magnitude. The ablation confirms this matters (removing it costs ~0.006 loss in Full AttnRes, more in Block).

One more practical detail that the paper flags as essential: all pseudo-queries wl\mathbf{w}_l must be initialized to zero. Zero queries make every ϕ(0,ki)=1\phi(\mathbf{0}, \mathbf{k}_i) = 1, so the initial attention is uniform — AttnRes starts life as exactly the equal-weight average, i.e. as a standard residual, and learns to deviate. Without this, early training is volatile.

What it costs

Per token, Full AttnRes needs O(L2d)O(L^2 d) arithmetic and O(Ld)O(Ld) memory to hold the layer outputs. The arithmetic is genuinely negligible because LTL \ll T. The memory is the catch — but in vanilla training it’s free, because those layer outputs are already retained for backprop. So on a single device, Full AttnRes is nearly a free lunch.

The problem is scale. At scale you use activation recomputation (which would otherwise free and recompute those outputs) and pipeline parallelism (which splits layers across devices). Now every layer output must be (a) kept alive instead of recomputed, and (b) transmitted across stage boundaries. Both blow up to O(Ld)O(Ld) — and the cross-stage communication, not the arithmetic, is the real wall. This is what Block AttnRes is for.

Block Attention Residuals

The idea: don’t attend over all LL individual layer outputs. Partition the LL layers into NN blocks of S=L/NS = L/N layers, sum within a block to get one representation per block, and attend over only the NN block summaries (plus the embedding). Memory and communication drop from O(Ld)O(Ld) to O(Nd)O(Nd).

Intra-block accumulation. Within block nn, layer outputs accumulate by ordinary summation:

bn=jBnfj(hj),bni=partial sum over the first i layers of Bn.\mathbf{b}_n = \sum_{j \in \mathcal{B}_n} f_j(\mathbf{h}_j), \qquad \mathbf{b}_n^{i} = \text{partial sum over the first } i \text{ layers of } \mathcal{B}_n.

So inside a block, you’re back to a standard residual; the learned attention only kicks in across block boundaries.

Inter-block attention. Set b0=h1\mathbf{b}_0 = \mathbf{h}_1 (the embedding is always an available source). For the ii-th layer in block nn, the value set is:

V={[b0,b1,,bn1]i=1 (first layer of the block)[b0,b1,,bn1,bni1]i2 (later layers also see the evolving partial sum)\mathbf{V} = \begin{cases} [\mathbf{b}_0, \mathbf{b}_1, \ldots, \mathbf{b}_{n-1}] & i = 1 \text{ (first layer of the block)} \\ [\mathbf{b}_0, \mathbf{b}_1, \ldots, \mathbf{b}_{n-1}, \mathbf{b}_n^{i-1}] & i \ge 2 \text{ (later layers also see the evolving partial sum)} \end{cases}

with the same kernel and normalization as Full AttnRes. So the first layer of a block sees the completed earlier blocks; subsequent layers additionally attend to the partial sum building up within the current block. The final output layer aggregates all NN block representations.

The block count NN is a clean interpolation knob:

  • N=LN = L (one layer per block) recovers Full AttnRes exactly.
  • N=1N = 1 reduces to a standard residual, with the embedding isolated as b0\mathbf{b}_0.

Empirically N8N \approx 8 recovers most of the Full AttnRes benefit while storing only eight hidden states per token.

Here is the paper’s PyTorch-style sketch — note how block_attn_res is just a softmax over stacked block reps using the learned projection weight as the query, and how forward threads partial_block (bni\mathbf{b}_n^i) and blocks ([b0,,bn1][\mathbf{b}_0,\dots,\mathbf{b}_{n-1}]) through one transformer layer:

def block_attn_res(blocks, partial_block, proj, norm):
    # blocks: N tensors [B, T, D] (completed block reps, incl. token embedding)
    # partial_block: [B, T, D]  (intra-block partial sum b_n^i)
    V = torch.stack(blocks + [partial_block])          # [N+1, B, T, D]
    K = norm(V)                                         # RMSNorm on keys
    logits = torch.einsum('d, n b t d -> n b t', proj.weight.squeeze(), K)
    h = torch.einsum('n b t, n b t d -> b t d', logits.softmax(0), V)
    return h

def forward(self, blocks, hidden_states):
    partial_block = hidden_states
    # --- attention sublayer ---
    h = block_attn_res(blocks, partial_block, self.attn_res_proj, self.attn_res_norm)
    if self.layer_number % (self.block_size // 2) == 0:   # block boundary
        blocks.append(partial_block)
        partial_block = None
    attn_out = self.attn(self.attn_norm(h))
    partial_block = attn_out if partial_block is None else partial_block + attn_out
    # --- MLP sublayer ---
    h = block_attn_res(blocks, partial_block, self.mlp_res_proj, self.mlp_res_norm)
    mlp_out = self.mlp(self.mlp_norm(h))
    partial_block = partial_block + mlp_out
    return blocks, partial_block

The structured-matrix view: linear vs. softmax attention over depth

This is the theoretical heart of the paper, and the part that elevates AttnRes from “a trick” to “a unifying lens.” Define a depth mixing matrix MRL×L\mathbf{M} \in \mathbb{R}^{L\times L} where Mil\mathbf{M}_{i\to l} is the weight layer ll assigns to source ii, so that hl=i=0l1Milvi\mathbf{h}_l = \sum_{i=0}^{l-1}\mathbf{M}_{i\to l}\,\mathbf{v}_i. Every residual variant ever proposed is some choice of M\mathbf{M}, and they differ in how the entries arise (fixed / learned-static / input-dependent) and in the semiseparable rank of M\mathbf{M}.

Standard residual. Unrolling gives Mil=1\mathbf{M}_{i\to l}=1 for all i<li<l: the all-ones lower-triangular matrix.

[h1h2hL]=[111111][v0v1vL1]\begin{bmatrix}\mathbf{h}_1\\\mathbf{h}_2\\\vdots\\\mathbf{h}_L\end{bmatrix} = \begin{bmatrix} 1 & & & \\ 1 & 1 & & \\ \vdots & \vdots & \ddots & \\ 1 & 1 & \cdots & 1 \end{bmatrix} \begin{bmatrix}\mathbf{v}_0\\\mathbf{v}_1\\\vdots\\\mathbf{v}_{L-1}\end{bmatrix}

This is 1-semiseparable — the lowest-rank structure possible.

Highway. With (scalar) carry gates, defining the carry product γil×=j=i+1l(1gj)\gamma_{i\to l}^{\times} = \prod_{j=i+1}^{l}(1-g_j), the weights are Mil=gi+1γi+1l×\mathbf{M}_{i\to l} = g_{i+1}\,\gamma_{i+1\to l}^{\times}. Because everything factors through scalar gates, M\mathbf{M} is still 1-semiseparable — same rank as the plain residual, just input-dependent. (The weights also sum to one, making Highway a softmax-free, depth-wise instance of stick-breaking attention.)

(m)HC / Hyper-Connections. Maintain mm parallel streams HlRd×m\mathbf{H}_l \in \mathbb{R}^{d\times m} updated by a learned transition matrix AlRm×m\mathbf{A}_l \in \mathbb{R}^{m\times m}. Unrolling yields

Mil=βiAi+1l×αl,Aij×=k=i+1jAk,\mathbf{M}_{i\to l} = \boldsymbol{\beta}_i^\top\, \mathbf{A}_{i+1\to l}^{\times}\, \boldsymbol{\alpha}_l, \qquad \mathbf{A}_{i\to j}^{\times} = \prod_{k=i+1}^{j}\mathbf{A}_k,

which is mm-semiseparable. The mm streams are exactly state expansion along the depth axis — the same trick that takes a recurrent state from dd to d×md\times m on the sequence side.

Full AttnRes. Mil=αil\mathbf{M}_{i\to l} = \alpha_{i\to l} from a softmax over input-dependent keys, giving a dense, rank-LL M\mathbf{M}.

Block AttnRes. Sources in a completed block share that block’s key/value, so they share a weight; the current block contributes one extra distinct source via its partial sum. The effective rank lands between NN and N+SN+S — exactly interpolating standard residual (N=1N{=}1) and Full AttnRes (N=LN{=}L).

MethodMixing matrix M\mathbf{M}WeightsRank
Standard residualall-ones lower-triangularfixed1-semiseparable
Highwaycumulative gate productsinput-dependent1-semiseparable
(m)HCmm streams, transition matricesinput-dependentmm-semiseparable
Full AttnResdense softmax weightsinput-dependentrank-LL (full)
Block AttnResblock-structuredinput-dependentbetween NN and N+SN+S

The punchline is a clean reframing. Look at the (m)HC weight Mil=βiAi+1l×αl\mathbf{M}_{i\to l} = \boldsymbol{\beta}_i^\top \mathbf{A}_{i+1\to l}^{\times}\boldsymbol{\alpha}_l: αl\boldsymbol{\alpha}_l is a query, βi\boldsymbol{\beta}_i is a key, and the cumulative transition A×\mathbf{A}^{\times} is a depth-relative positional operator. That is precisely the linear attention form (a separable query–key product, accumulated by a recurrence). So:

Prior residual generalizations — Highway, Hyper-Connections, (m)HC — are all depth-wise linear attention. AttnRes is depth-wise softmax attention. The progression over depth (residual → gated recurrence → softmax attention) is the same progression that played out over sequence (RNN → linear attention → Transformer).

The structured-matrix view also pays off diagnostically: the input-dependent M\mathbf{M} exposes depth-wise attention sinks — specific layers that consistently attract weight regardless of input, mirroring the token-level attention-sink phenomenon. And whenever the kernel factorizes as ϕ(q,k)=φ(q)φ(k)\phi(\mathbf{q},\mathbf{k}) = \varphi(\mathbf{q})^\top\varphi(\mathbf{k}), depth-wise attention collapses back into a recurrence — which is exactly why the linear-attention residual variants exist as special cases.

Infrastructure: making it survive pipeline parallelism

A clever residual that can’t be trained at scale is a footnote. The infra section is where Block AttnRes earns “drop-in replacement.”

Training: cross-stage caching

Under an interleaved pipeline schedule with PP physical stages and VV virtual stages, Block AttnRes needs all accumulated block reps at each stage to do inter-block attention. The naive approach re-sends the entire block history at every stage transition. With C=PVC = PV chunks, chunk jj carries jNpjN_p blocks, so the naive per-token communication is

Commnaive=j=1C1jNpd=C(C1)2Npd.\mathrm{Comm}_{\text{naive}} = \sum_{j=1}^{C-1} jN_p d = \frac{C(C-1)}{2}N_p d.

The fix is cross-stage caching: each physical stage processes multiple virtual stages in succession, so blocks it received earlier are still in local memory and need not be re-transmitted. Only the incremental blocks since the receiver’s previous chunk cross the wire:

Commcached=P(P1)2Npdfirst virtual stage+(V1)P2Npdlater virtual stages.\mathrm{Comm}_{\text{cached}} = \underbrace{\tfrac{P(P-1)}{2}N_p d}_{\text{first virtual stage}} + \underbrace{(V-1)P^2 N_p d}_{\text{later virtual stages}}.

This cuts the peak per-transition cost from O(C)O(C) to O(P)O(P) — a V×V\times improvement — which is enough to fully overlap with compute in steady-state 1F1B. The backward pass benefits identically. Because activation checkpointing erases all the inter-block attention intermediates and the checkpointed input matches the size of the hidden state it replaces, per-layer activation memory is unchanged from a standard architecture. Measured end-to-end training overhead under pipeline parallelism: < 4% (and negligible without pipeline parallelism).

Inference: the two-phase strategy

Layer-wise AttnRes at decode time looks just like autoregressive decoding, with block reps playing the role of a KV cache reused across layers. A naive implementation re-reads all preceding blocks at every layer: O(LN)O(L\cdot N) memory accesses. The two-phase strategy exploits the fact that the pseudo-queries are decoupled from the forward pass, so all SS queries in a block can be batched:

  • Phase 1 — parallel inter-block attention. Batch all SS block-layer queries into a single matmul against the cached block reps, returning outputs and softmax statistics (max, log-sum-exp). This amortizes the read cost from SS passes to one.
  • Phase 2 — sequential intra-block attention. For each layer, attend over the evolving partial sum, then merge with the Phase-1 result via online softmax. The merge is elementwise, so it fuses cleanly with surrounding ops (e.g. RMSNorm).

The merge is the standard streaming-softmax combination, exact (not an approximation):

hl=em(1)mo(1)+em(2)mo(2)em(1)m(1)+em(2)m(2),m=max(m(1),m(2)).\mathbf{h}_l = \frac{e^{m^{(1)}-m}\mathbf{o}^{(1)} + e^{m^{(2)}-m}\mathbf{o}^{(2)}}{e^{m^{(1)}-m}\ell^{(1)} + e^{m^{(2)}-m}\ell^{(2)}}, \qquad m = \max(m^{(1)}, m^{(2)}).

The per-layer memory-access accounting (the metric that actually governs decode latency, which is bandwidth-bound) is where the design pays off:

MechanismPer-layer total I/O (symbolic)Typical (L=128,N=8,S=16,m=4L{=}128, N{=}8, S{=}16, m{=}4)
Standard residual3d3d3d3d
mHC (mm streams)(8m+2)d+2m2+4m(8m+2)d + 2m^2 + 4m34d34d
Full AttnRes (two-phase)(S+N)d(S+N)d24d24d
Block AttnRes (two-phase)(NS+5)d\left(\tfrac{N}{S}+5\right)d5.5d5.5d

Block AttnRes lands at 5.5d5.5d — within a small constant of the standard residual’s 3d3d, and 6× cheaper than mHC’s 34d34d despite delivering comparable quality. End-to-end inference latency overhead: < 2%.

Memory-efficient prefilling

Storing block reps during prefill costs NTdN\cdot T\cdot d — about 15 GB for a 128K-token sequence with 8 blocks. Sharding the reps along the sequence dimension across PP tensor-parallel devices lets Phase 1 run on local shards, with the Phase-2 online-softmax merge folded into the existing TP all-reduce path (reduce-scatter → local merge → all-gather). That drops per-device footprint to N(T/P)dN\cdot(T/P)\cdot d~1.9 GB per device, and < 0.3 GB with 16K chunked prefill.

Experiments

The architecture is Kimi Linear unchanged — a MoE transformer interleaving Kimi Delta Attention (KDA) and MLA layers 3:1, each followed by an MoE FFN — with AttnRes added to the residual connections and nothing else touched.

Scaling laws

Five model sizes (194M–528M activated params), three variants each (PreNorm baseline, Full AttnRes, Block AttnRes with N8N\approx 8), identical hyperparameters chosen under the baseline (a deliberately conservative setup that favors the baseline).

Activated paramsBaselineBlock AttnRes (N=8N{=}8)Full AttnResmHC(-lite)
194M1.9311.9091.8991.906
241M1.8951.8751.8741.869
296M1.8291.8091.8041.807
436M1.7661.7461.7371.747
528M1.7191.6931.6921.694

Fitted power laws L=ACα\mathcal{L} = A\,C^{-\alpha} (C in PFLOP/s-days):

  • Baseline: L=1.891C0.057\mathcal{L} = 1.891\,C^{-0.057}
  • Block AttnRes: L=1.870C0.058\mathcal{L} = 1.870\,C^{-0.058}
  • Full AttnRes: L=1.865C0.057\mathcal{L} = 1.865\,C^{-0.057}

The slopes are nearly identical; AttnRes just sits on a lower intercept across the whole compute range. At 5.6 PFLOP/s-days, Block AttnRes reaches 1.692 vs. the baseline’s 1.714 — equivalent to a 1.25×1.25\times compute advantage. The Full-vs-Block gap narrows with scale, down to 0.001 at 528M. And Block AttnRes matches mHC’s quality at a fraction of the per-layer I/O (5.5d5.5d vs. 34d34d).

Downstream: a 48B MoE on 1.4T tokens

The headline model is the full Kimi Linear 48B config — 27 transformer blocks (54 layers), 8-of-256 routed experts plus 1 shared, 48B total / 3B activated — with Block AttnRes at 6 layers/block (9 blocks + embedding = 10 depth sources). Trained on 1.4T tokens (1T WSD pretraining + ~400B mid-training anneal), then extended to 32K context. Because MLA layers here run NoPE, context extension needed no YaRN or temperature rescaling.

Block AttnRes matches or beats the baseline on all 15 benchmarks:

BaselineAttnResΔ
MMLU73.574.6+1.1
MMLU-Pro52.252.2
GPQA-Diamond36.944.4+7.5
BBH76.378.0+1.7
ARC-Challenge64.665.7+1.1
HellaSwag83.283.4+0.2
TriviaQA69.971.8+1.9
GSM8K81.782.4+0.7
MGSM64.966.1+1.2
Math53.557.1+3.6
CMath84.785.1+0.4
HumanEval59.162.2+3.1
MBPP72.073.9+1.9
CMMLU82.082.9+0.9
C-Eval79.682.5+2.9

The gains concentrate on multi-step reasoning and code (GPQA +7.5, Math +3.6, HumanEval +3.1) — consistent with the hypothesis that better depth-wise information flow most helps compositional tasks, where deep layers want to selectively retrieve and build on specific earlier representations rather than a blurred average.

Ablations (436M model)

VariantVal loss
Baseline (PreNorm)1.766
DenseFormer (fixed, input-independent scalars)1.767
Sliding-window over depth (W=8W{=}8 recent layers + embedding)1.764
mHC (mm streams, learned mixing)1.747
Block AttnRes (S=4S{=}4)1.746
Full AttnRes1.737
Full AttnRes + input-dependent query1.731 (adds a d×dd\times d projection)

Several of these are pointed:

  • DenseFormer ≈ baseline (1.767). Giving every layer access to all earlier outputs but with fixed, input-independent weights buys nothing. The input-dependence is the active ingredient, not the cross-layer access alone.
  • Sliding-window over depth (1.764) ≫ Full AttnRes (1.737). Restricting attention to the 8 nearest layers beats the baseline only slightly. So selectively reaching distant layers matters far more than attending to many nearby ones — the opposite of a locality prior.
  • Block size sweep. S=2,4,8S=2,4,8 all cluster near 1.746; S=16,32S=16,32 degrade toward baseline. Hence the N8N\approx 8 default. As hardware loosens memory limits, finer blocks (toward Full AttnRes) are the natural upgrade path.
  • Softmax > sigmoid (1.741). The competitive normalization of softmax forces sharper selection. Depth-wise multihead hurts (1.752 vs 1.746) — the optimal mixture is largely uniform across channels: when a layer’s output is relevant, it’s relevant as a whole. Input-independent mixing hurts (1.749), removing RMSNorm hurts (1.743/1.750).

Where AttnRes wants to spend capacity

A 25-config sweep at fixed compute and active params (varying dmodel/Lbd_\text{model}/L_b and H/LbH/L_b) finds AttnRes beats the baseline in all 25 cells (by 0.019–0.063), and — more interesting — shifts the optimum toward deeper, narrower models: the baseline’s best is at dmodel/Lb60d_\text{model}/L_b \approx 60, AttnRes’s at 45\approx 45. AttnRes can exploit additional depth more effectively, which is exactly what you’d predict from a mechanism that improves depth-wise information flow. (The paper is careful: this is a diagnostic, not a deployment recommendation, since deeper models cost more sequential latency at inference.)

Training dynamics: what actually changes

Comparing baseline vs. Block AttnRes on the 48B run over 1T tokens makes the mechanism legible:

  • Validation loss is consistently lower, and the gap widens during the LR decay phase — AttnRes converts the annealing into more gain.
  • Output magnitude. The baseline shows textbook PreNorm dilution: hidden-state magnitude grows monotonically with depth, forcing deep layers to emit ever-larger outputs. Block AttnRes confines that growth within a block — selective re-aggregation at each block boundary resets the accumulation, producing a bounded, periodic magnitude profile instead of a monotone ramp.
  • Gradient magnitude. With all residual weights pinned to 1, the baseline has no way to regulate gradient flow and dumps disproportionately large gradients into the earliest layers. AttnRes’s learned softmax weights make sources compete for probability mass, which spreads gradients far more uniformly across depth.

And the learned weight patterns αil\alpha_{i\to l} are interpretable: strong diagonal (locality is preserved — each layer still leans on its immediate predecessor), but with clear off-diagonal concentrations (learned skip connections to specific distant layers), persistent non-trivial weight on the embedding throughout, and a layer-type split — pre-attention layers keep broad receptive fields while pre-MLP layers rely more sharply on recent representations. All of this structure transfers intact from Full to Block AttnRes, which is why block compression behaves like a regularizer rather than a lobotomy.

When to use which

Full AttnResBlock AttnRes
Best formaximum quality; smaller models; future fast interconnectspractical large-scale training and serving today
Qualityslightly better (gap → 0.001 at 528M)recovers most of Full’s gains at N8N\approx 8
Training overheadO(Ld)O(Ld) cross-stage comm — impractical at scale now< 4% under pipeline parallelism
Inference I/O24d\approx 24d/layer5.5d\approx 5.5d/layer (vs. 3d3d standard, 34d34d for mHC)
Inference latencypractical but higher< 2%

The honest summary: Block AttnRes with ~8 blocks is the recommended production variant, and Full AttnRes is the theoretical ceiling that better interconnect hardware will eventually make practical. The two are the same mechanism at different points on the NN interpolation.

Takeaways

  1. The residual connection was the last fixed-weight aggregator in the transformer, and it didn’t have to be. AttnRes makes depth-wise mixing learned and input-dependent with astonishingly little machinery — one dd-vector and one RMSNorm per layer.
  2. The time–depth duality is more than an analogy; it’s a generative principle. It predicts the whole family: prior residual variants are depth-wise linear attention, AttnRes is depth-wise softmax attention, and the structured-matrix rank (1mL1 \to m \to L) tracks the move exactly. It also predicts new designs — linear-complexity depth attention is the obvious next step the paper flags.
  3. The contribution is as much systems as it is modeling. Cross-stage caching, the two-phase online-softmax inference schedule, and sequence-sharded prefilling are what turn a O(Ld)O(Ld) idea into a <4%<4\%/<2%<2\% drop-in. Without them, Full AttnRes is a single-GPU curiosity.
  4. The wins are exactly where you’d hope. Multi-step reasoning and code — the compositional tasks where a deep layer genuinely wants to retrieve a specific earlier representation rather than a flat average — improve most. PreNorm dilution is mitigated visibly in the magnitude and gradient profiles, not just in the loss number.

If you’ve been reading the recent attention literature, the satisfying thing about this paper is that it closes a symmetry. We spent a decade making the sequence axis learned and content-aware while leaving the depth axis on the fixed recurrence it inherited from 2015. AttnRes is the argument that depth deserves the same treatment — and that, because depth is small, it’s nearly free to give.

References

  • Kimi Team (Moonshot AI). Attention Residuals. arXiv:2603.15031, 2026. (code)
  • Zhang, Y. et al. Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.xxxxx, 2025.
  • He, K. et al. Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015.
  • Xiong, R. et al. On Layer Normalization in the Transformer Architecture. arXiv:2002.04745, 2020.
  • Srivastava, R., Greff, K., Schmidhuber, J. Highway Networks. arXiv:1505.00387, 2015.
  • Zhu, D. et al. Hyper-Connections. arXiv:2409.19606, 2024.
  • Pagliardini, M. et al. DenseFormer: Enhancing Information Flow in Transformers via Depth-Weighted Averaging. arXiv:2402.02622, 2024.
  • Sun, Y. et al. Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv:2407.04620, 2024.
  • Dao, T., Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2). arXiv:2405.21060, 2024.
  • Milakov, M., Gimelshein, N. Online normalizer calculation for softmax. arXiv:1805.02867, 2018.