Attention Residuals: Softmax Attention Over Depth

Almost every component of a modern transformer has, over the past few years, traded fixed behavior for learned, input-dependent behavior. Sequence mixing went from fixed convolutions to attention. Feed-forward capacity went from one dense MLP to input-routed experts. Even normalization has acquired learned scales and placements. One component sat untouched through all of it: the residual connection. The update

\mathbf{h}_l = \mathbf{h}_{l-1} + f_{l-1}(\mathbf{h}_{l-1})

has been copied verbatim from ResNet into essentially every LLM, and it still aggregates layer outputs with fixed unit weights.

The Kimi team’s Attention Residuals (AttnRes, arXiv:2603.15031) is the argument that this last fixed component should be made learned and content-dependent too — and that the right way to do it is softmax attention over the depth axis. The framing is a duality: residual connections are to depth what RNNs were to sequence, and AttnRes is the same RNN→Transformer move applied to depth instead of time. This post walks the math, the structured-matrix theory that unifies it with prior residual variants, the infrastructure that makes it survive pipeline parallelism, and the experiments — including a 48B-parameter MoE pretrained on 1.4T tokens.

Open Table of contents

Why touch the residual connection at all?
The duality of time and depth
Full Attention Residuals
- What it costs
Block Attention Residuals
The structured-matrix view: linear vs. softmax attention over depth
Infrastructure: making it survive pipeline parallelism
Experiments
Training dynamics: what actually changes
When to use which
Takeaways
- References

Why touch the residual connection at all?

Two roles of the residual connection are worth separating.

The famous one is the gradient highway. Backpropagating through the residual recurrence gives

\frac{\partial \mathcal{L}}{\partial \mathbf{h}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \cdot \prod_{j=l}^{L-1}\left(\mathbf{I} + \frac{\partial f_j}{\partial \mathbf{h}_j}\right),

and expanding the product always leaves a bare $\mathbf{I}$ term — a direct path from the loss to any layer, regardless of depth. This is what lets us train deep networks at all, and AttnRes does not want to lose it.

The less-discussed role is as a depth-wise aggregation rule. Unroll the recurrence and the hidden state entering layer $l$ is

\mathbf{h}_l = \mathbf{h}_1 + \sum_{i=1}^{l-1} f_i(\mathbf{h}_i),

i.e. the token embedding plus a uniformly weighted sum of every preceding layer output. Every layer receives the same flat average of everything below it. There is no mechanism to emphasize one earlier layer over another, and no way for an attention layer and an MLP layer to ask for different mixtures of the past.

Under PreNorm — the dominant placement in modern LLMs — this flat accumulation has a concrete pathology. Because each layer’s output is added (not normalized) into the stream, $\lVert \mathbf{h}_l \rVert$ grows as $O(L)$ with depth. The normalization sits before each sublayer, so each $f_l$ sees a unit-scale input but writes into an ever-larger residual. The relative contribution of any single layer therefore shrinks with depth — later layers must learn ever-larger outputs just to stay audible. This is PreNorm dilution, and it shows up empirically as the well-known result that you can prune a large fraction of an LLM’s middle layers with little loss: their contributions were being drowned out anyway.

The paper names three concrete limitations of single-state residuals (fixed or gated):

No selective access. Different layer types (attention vs. MLP) get the same aggregated state, even when they would benefit from different weightings.
Irreversible loss. Information blurred together by summation cannot be selectively recovered later.
Output growth. Later layers inflate their outputs to gain influence over the accumulated residual, which destabilizes training.

These are exactly the symptoms RNNs had on the sequence axis before attention. Which is the whole idea.

The duality of time and depth

Lay the two recurrences side by side:

	Sequence axis (RNN)	Depth axis (Residual)
State	$\mathbf{s}_t$	$\mathbf{h}_l$
Update	$\mathbf{s}_t = g(\mathbf{s}_{t-1}, \mathbf{x}_t)$	$\mathbf{h}_l = \mathbf{h}_{l-1} + f_{l-1}(\mathbf{h}_{l-1})$
Bottleneck	one state compresses all past tokens	one state compresses all past layers
Fix	attention: selectively read all positions	AttnRes: selectively read all layers

On the sequence side, the Transformer’s answer to RNN compression was to let each position attend over all previous positions with data-dependent weights. AttnRes proposes the identical move over depth:

\mathbf{h}_l = \alpha_{0 \to l}\,\mathbf{h}_1 + \sum_{i=1}^{l-1} \alpha_{i \to l}\, f_i(\mathbf{h}_i), \qquad \sum_{i=0}^{l-1}\alpha_{i\to l} = 1,

where $\alpha_{i\to l}$ are learned, input-dependent attention weights. The crucial observation that makes this affordable: while sequence length reaches millions of tokens, network depth is modest — $L < 1000$ in any real model. So $O(L^2)$ attention over depth, which would be unthinkable over a long sequence, is essentially free.

Full Attention Residuals

Write the weights as a normalized kernel $\alpha_{i\to l} = \phi(\mathbf{q}_l, \mathbf{k}_i) / \sum_j \phi(\mathbf{q}_l, \mathbf{k}_j)$ . The choice of $\phi$ determines which residual variant you get (more on this in the structured-matrix section); AttnRes picks the exponential kernel with a normalized key, giving plain softmax attention over depth:

\phi(\mathbf{q}, \mathbf{k}) = \exp\!\left(\mathbf{q}^\top \mathrm{RMSNorm}(\mathbf{k})\right), \qquad \alpha_{i \to l} = \frac{\phi(\mathbf{q}_l, \mathbf{k}_i)}{\sum_{j=0}^{l-1}\phi(\mathbf{q}_l, \mathbf{k}_j)}.

The queries, keys, and values are deliberately minimal:

\mathbf{q}_l = \mathbf{w}_l, \qquad \mathbf{k}_i = \mathbf{v}_i = \begin{cases} \mathbf{h}_1 & i = 0 \\ f_i(\mathbf{h}_i) & 1 \le i \le l-1, \end{cases}

so the input to layer $l$ is just $\mathbf{h}_l = \sum_{i=0}^{l-1} \alpha_{i\to l}\,\mathbf{v}_i$ .

Three design decisions deserve attention:

The query is a bare learned vector $\mathbf{w}_l \in \mathbb{R}^d$ , one per layer — a “pseudo-query.” There is no query projection of the current hidden state. This is the entire parameter cost of AttnRes: one $d$ -vector per layer plus one RMSNorm. It looks like a limitation but is the key to the infrastructure story below: because $\mathbf{w}_l$ doesn’t depend on layer $l$ ‘s forward computation, the attention weights for a whole group of layers can be computed in parallel, without waiting for each layer’s output in sequence.
Keys and values are the same thing — the raw layer outputs. No separate key/value projections.
RMSNorm sits inside the kernel, on the keys. Without it, a layer that naturally produces large-magnitude outputs would dominate the softmax purely on scale. Normalizing the keys forces the selection to be about direction/content, not magnitude. The ablation confirms this matters (removing it costs ~0.006 loss in Full AttnRes, more in Block).

One more practical detail that the paper flags as essential: all pseudo-queries $\mathbf{w}_l$ must be initialized to zero. Zero queries make every $\phi(\mathbf{0}, \mathbf{k}_i) = 1$ , so the initial attention is uniform — AttnRes starts life as exactly the equal-weight average, i.e. as a standard residual, and learns to deviate. Without this, early training is volatile.

What it costs

Per token, Full AttnRes needs $O(L^2 d)$ arithmetic and $O(Ld)$ memory to hold the layer outputs. The arithmetic is genuinely negligible because $L \ll T$ . The memory is the catch — but in vanilla training it’s free, because those layer outputs are already retained for backprop. So on a single device, Full AttnRes is nearly a free lunch.

The problem is scale. At scale you use activation recomputation (which would otherwise free and recompute those outputs) and pipeline parallelism (which splits layers across devices). Now every layer output must be (a) kept alive instead of recomputed, and (b) transmitted across stage boundaries. Both blow up to $O(Ld)$ — and the cross-stage communication, not the arithmetic, is the real wall. This is what Block AttnRes is for.

Block Attention Residuals

The idea: don’t attend over all $L$ individual layer outputs. Partition the $L$ layers into $N$ blocks of $S = L/N$ layers, sum within a block to get one representation per block, and attend over only the $N$ block summaries (plus the embedding). Memory and communication drop from $O(Ld)$ to $O(Nd)$ .

Intra-block accumulation. Within block $n$ , layer outputs accumulate by ordinary summation:

\mathbf{b}_n = \sum_{j \in \mathcal{B}_n} f_j(\mathbf{h}_j), \qquad \mathbf{b}_n^{i} = \text{partial sum over the first } i \text{ layers of } \mathcal{B}_n.

So inside a block, you’re back to a standard residual; the learned attention only kicks in across block boundaries.

Inter-block attention. Set $\mathbf{b}_0 = \mathbf{h}_1$ (the embedding is always an available source). For the $i$ -th layer in block $n$ , the value set is:

\mathbf{V} = \begin{cases} [\mathbf{b}_0, \mathbf{b}_1, \ldots, \mathbf{b}_{n-1}] & i = 1 \text{ (first layer of the block)} \\ [\mathbf{b}_0, \mathbf{b}_1, \ldots, \mathbf{b}_{n-1}, \mathbf{b}_n^{i-1}] & i \ge 2 \text{ (later layers also see the evolving partial sum)} \end{cases}

with the same kernel and normalization as Full AttnRes. So the first layer of a block sees the completed earlier blocks; subsequent layers additionally attend to the partial sum building up within the current block. The final output layer aggregates all $N$ block representations.

The block count $N$ is a clean interpolation knob:

$N = L$ (one layer per block) recovers Full AttnRes exactly.
$N = 1$ reduces to a standard residual, with the embedding isolated as $\mathbf{b}_0$ .

Empirically $N \approx 8$ recovers most of the Full AttnRes benefit while storing only eight hidden states per token.

Here is the paper’s PyTorch-style sketch — note how block_attn_res is just a softmax over stacked block reps using the learned projection weight as the query, and how forward threads partial_block ( $\mathbf{b}_n^i$ ) and blocks ( $[\mathbf{b}_0,\dots,\mathbf{b}_{n-1}]$ ) through one transformer layer:

def block_attn_res(blocks, partial_block, proj, norm):
    # blocks: N tensors [B, T, D] (completed block reps, incl. token embedding)
    # partial_block: [B, T, D]  (intra-block partial sum b_n^i)
    V = torch.stack(blocks + [partial_block])          # [N+1, B, T, D]
    K = norm(V)                                         # RMSNorm on keys
    logits = torch.einsum('d, n b t d -> n b t', proj.weight.squeeze(), K)
    h = torch.einsum('n b t, n b t d -> b t d', logits.softmax(0), V)
    return h

def forward(self, blocks, hidden_states):
    partial_block = hidden_states
    # --- attention sublayer ---
    h = block_attn_res(blocks, partial_block, self.attn_res_proj, self.attn_res_norm)
    if self.layer_number % (self.block_size // 2) == 0:   # block boundary
        blocks.append(partial_block)
        partial_block = None
    attn_out = self.attn(self.attn_norm(h))
    partial_block = attn_out if partial_block is None else partial_block + attn_out
    # --- MLP sublayer ---
    h = block_attn_res(blocks, partial_block, self.mlp_res_proj, self.mlp_res_norm)
    mlp_out = self.mlp(self.mlp_norm(h))
    partial_block = partial_block + mlp_out
    return blocks, partial_block

The structured-matrix view: linear vs. softmax attention over depth

This is the theoretical heart of the paper, and the part that elevates AttnRes from “a trick” to “a unifying lens.” Define a depth mixing matrix $\mathbf{M} \in \mathbb{R}^{L\times L}$ where $\mathbf{M}_{i\to l}$ is the weight layer $l$ assigns to source $i$ , so that $\mathbf{h}_l = \sum_{i=0}^{l-1}\mathbf{M}_{i\to l}\,\mathbf{v}_i$ . Every residual variant ever proposed is some choice of $\mathbf{M}$ , and they differ in how the entries arise (fixed / learned-static / input-dependent) and in the semiseparable rank of $\mathbf{M}$ .

Standard residual. Unrolling gives $\mathbf{M}_{i\to l}=1$ for all $i<l$ : the all-ones lower-triangular matrix.

\begin{bmatrix}\mathbf{h}_1\\\mathbf{h}_2\\\vdots\\\mathbf{h}_L\end{bmatrix} = \begin{bmatrix} 1 & & & \\ 1 & 1 & & \\ \vdots & \vdots & \ddots & \\ 1 & 1 & \cdots & 1 \end{bmatrix} \begin{bmatrix}\mathbf{v}_0\\\mathbf{v}_1\\\vdots\\\mathbf{v}_{L-1}\end{bmatrix}

This is 1-semiseparable — the lowest-rank structure possible.

Highway. With (scalar) carry gates, defining the carry product $\gamma_{i\to l}^{\times} = \prod_{j=i+1}^{l}(1-g_j)$ , the weights are $\mathbf{M}_{i\to l} = g_{i+1}\,\gamma_{i+1\to l}^{\times}$ . Because everything factors through scalar gates, $\mathbf{M}$ is still 1-semiseparable — same rank as the plain residual, just input-dependent. (The weights also sum to one, making Highway a softmax-free, depth-wise instance of stick-breaking attention.)

(m)HC / Hyper-Connections. Maintain $m$ parallel streams $\mathbf{H}_l \in \mathbb{R}^{d\times m}$ updated by a learned transition matrix $\mathbf{A}_l \in \mathbb{R}^{m\times m}$ . Unrolling yields

\mathbf{M}_{i\to l} = \boldsymbol{\beta}_i^\top\, \mathbf{A}_{i+1\to l}^{\times}\, \boldsymbol{\alpha}_l, \qquad \mathbf{A}_{i\to j}^{\times} = \prod_{k=i+1}^{j}\mathbf{A}_k,

which is $m$ -semiseparable. The $m$ streams are exactly state expansion along the depth axis — the same trick that takes a recurrent state from $d$ to $d\times m$ on the sequence side.

Full AttnRes. $\mathbf{M}_{i\to l} = \alpha_{i\to l}$ from a softmax over input-dependent keys, giving a dense, rank- $L$ $\mathbf{M}$ .

Block AttnRes. Sources in a completed block share that block’s key/value, so they share a weight; the current block contributes one extra distinct source via its partial sum. The effective rank lands between $N$ and $N+S$ — exactly interpolating standard residual ( $N{=}1$ ) and Full AttnRes ( $N{=}L$ ).

Method	Mixing matrix $\mathbf{M}$	Weights	Rank
Standard residual	all-ones lower-triangular	fixed	1-semiseparable
Highway	cumulative gate products	input-dependent	1-semiseparable
(m)HC	$m$ streams, transition matrices	input-dependent	$m$ -semiseparable
Full AttnRes	dense softmax weights	input-dependent	rank- $L$ (full)
Block AttnRes	block-structured	input-dependent	between $N$ and $N+S$

The punchline is a clean reframing. Look at the (m)HC weight $\mathbf{M}_{i\to l} = \boldsymbol{\beta}_i^\top \mathbf{A}_{i+1\to l}^{\times}\boldsymbol{\alpha}_l$ : $\boldsymbol{\alpha}_l$ is a query, $\boldsymbol{\beta}_i$ is a key, and the cumulative transition $\mathbf{A}^{\times}$ is a depth-relative positional operator. That is precisely the linear attention form (a separable query–key product, accumulated by a recurrence). So:

Prior residual generalizations — Highway, Hyper-Connections, (m)HC — are all depth-wise linear attention. AttnRes is depth-wise softmax attention. The progression over depth (residual → gated recurrence → softmax attention) is the same progression that played out over sequence (RNN → linear attention → Transformer).

The structured-matrix view also pays off diagnostically: the input-dependent $\mathbf{M}$ exposes depth-wise attention sinks — specific layers that consistently attract weight regardless of input, mirroring the token-level attention-sink phenomenon. And whenever the kernel factorizes as $\phi(\mathbf{q},\mathbf{k}) = \varphi(\mathbf{q})^\top\varphi(\mathbf{k})$ , depth-wise attention collapses back into a recurrence — which is exactly why the linear-attention residual variants exist as special cases.

Infrastructure: making it survive pipeline parallelism

A clever residual that can’t be trained at scale is a footnote. The infra section is where Block AttnRes earns “drop-in replacement.”

Training: cross-stage caching

Under an interleaved pipeline schedule with $P$ physical stages and $V$ virtual stages, Block AttnRes needs all accumulated block reps at each stage to do inter-block attention. The naive approach re-sends the entire block history at every stage transition. With $C = PV$ chunks, chunk $j$ carries $jN_p$ blocks, so the naive per-token communication is

\mathrm{Comm}_{\text{naive}} = \sum_{j=1}^{C-1} jN_p d = \frac{C(C-1)}{2}N_p d.

The fix is cross-stage caching: each physical stage processes multiple virtual stages in succession, so blocks it received earlier are still in local memory and need not be re-transmitted. Only the incremental blocks since the receiver’s previous chunk cross the wire:

\mathrm{Comm}_{\text{cached}} = \underbrace{\tfrac{P(P-1)}{2}N_p d}_{\text{first virtual stage}} + \underbrace{(V-1)P^2 N_p d}_{\text{later virtual stages}}.

This cuts the peak per-transition cost from $O(C)$ to $O(P)$ — a $V\times$ improvement — which is enough to fully overlap with compute in steady-state 1F1B. The backward pass benefits identically. Because activation checkpointing erases all the inter-block attention intermediates and the checkpointed input matches the size of the hidden state it replaces, per-layer activation memory is unchanged from a standard architecture. Measured end-to-end training overhead under pipeline parallelism: < 4% (and negligible without pipeline parallelism).

Inference: the two-phase strategy

Layer-wise AttnRes at decode time looks just like autoregressive decoding, with block reps playing the role of a KV cache reused across layers. A naive implementation re-reads all preceding blocks at every layer: $O(L\cdot N)$ memory accesses. The two-phase strategy exploits the fact that the pseudo-queries are decoupled from the forward pass, so all $S$ queries in a block can be batched:

Phase 1 — parallel inter-block attention. Batch all $S$ block-layer queries into a single matmul against the cached block reps, returning outputs and softmax statistics (max, log-sum-exp). This amortizes the read cost from $S$ passes to one.
Phase 2 — sequential intra-block attention. For each layer, attend over the evolving partial sum, then merge with the Phase-1 result via online softmax. The merge is elementwise, so it fuses cleanly with surrounding ops (e.g. RMSNorm).

The merge is the standard streaming-softmax combination, exact (not an approximation):

\mathbf{h}_l = \frac{e^{m^{(1)}-m}\mathbf{o}^{(1)} + e^{m^{(2)}-m}\mathbf{o}^{(2)}}{e^{m^{(1)}-m}\ell^{(1)} + e^{m^{(2)}-m}\ell^{(2)}}, \qquad m = \max(m^{(1)}, m^{(2)}).

The per-layer memory-access accounting (the metric that actually governs decode latency, which is bandwidth-bound) is where the design pays off:

Mechanism	Per-layer total I/O (symbolic)	Typical ( $L{=}128, N{=}8, S{=}16, m{=}4$ )
Standard residual	$3d$	$3d$
mHC ( $m$ streams)	$(8m+2)d + 2m^2 + 4m$	$34d$
Full AttnRes (two-phase)	$(S+N)d$	$24d$
Block AttnRes (two-phase)	$\left(\tfrac{N}{S}+5\right)d$	$5.5d$

Block AttnRes lands at $5.5d$ — within a small constant of the standard residual’s $3d$ , and 6× cheaper than mHC’s $34d$ despite delivering comparable quality. End-to-end inference latency overhead: < 2%.

Memory-efficient prefilling

Storing block reps during prefill costs $N\cdot T\cdot d$ — about 15 GB for a 128K-token sequence with 8 blocks. Sharding the reps along the sequence dimension across $P$ tensor-parallel devices lets Phase 1 run on local shards, with the Phase-2 online-softmax merge folded into the existing TP all-reduce path (reduce-scatter → local merge → all-gather). That drops per-device footprint to $N\cdot(T/P)\cdot d$ — ~1.9 GB per device, and < 0.3 GB with 16K chunked prefill.

Experiments

The architecture is Kimi Linear unchanged — a MoE transformer interleaving Kimi Delta Attention (KDA) and MLA layers 3:1, each followed by an MoE FFN — with AttnRes added to the residual connections and nothing else touched.

Scaling laws

Five model sizes (194M–528M activated params), three variants each (PreNorm baseline, Full AttnRes, Block AttnRes with $N\approx 8$ ), identical hyperparameters chosen under the baseline (a deliberately conservative setup that favors the baseline).

Activated params	Baseline	Block AttnRes ( $N{=}8$ )	Full AttnRes	mHC(-lite)
194M	1.931	1.909	1.899	1.906
241M	1.895	1.875	1.874	1.869
296M	1.829	1.809	1.804	1.807
436M	1.766	1.746	1.737	1.747
528M	1.719	1.693	1.692	1.694

Fitted power laws $\mathcal{L} = A\,C^{-\alpha}$ (C in PFLOP/s-days):

Baseline: $\mathcal{L} = 1.891\,C^{-0.057}$
Block AttnRes: $\mathcal{L} = 1.870\,C^{-0.058}$
Full AttnRes: $\mathcal{L} = 1.865\,C^{-0.057}$

The slopes are nearly identical; AttnRes just sits on a lower intercept across the whole compute range. At 5.6 PFLOP/s-days, Block AttnRes reaches 1.692 vs. the baseline’s 1.714 — equivalent to a $1.25\times$ compute advantage. The Full-vs-Block gap narrows with scale, down to 0.001 at 528M. And Block AttnRes matches mHC’s quality at a fraction of the per-layer I/O ( $5.5d$ vs. $34d$ ).

Downstream: a 48B MoE on 1.4T tokens

The headline model is the full Kimi Linear 48B config — 27 transformer blocks (54 layers), 8-of-256 routed experts plus 1 shared, 48B total / 3B activated — with Block AttnRes at 6 layers/block (9 blocks + embedding = 10 depth sources). Trained on 1.4T tokens (1T WSD pretraining + ~400B mid-training anneal), then extended to 32K context. Because MLA layers here run NoPE, context extension needed no YaRN or temperature rescaling.

Block AttnRes matches or beats the baseline on all 15 benchmarks:

	Baseline	AttnRes	Δ
MMLU	73.5	74.6	+1.1
MMLU-Pro	52.2	52.2	—
GPQA-Diamond	36.9	44.4	+7.5
BBH	76.3	78.0	+1.7
ARC-Challenge	64.6	65.7	+1.1
HellaSwag	83.2	83.4	+0.2
TriviaQA	69.9	71.8	+1.9
GSM8K	81.7	82.4	+0.7
MGSM	64.9	66.1	+1.2
Math	53.5	57.1	+3.6
CMath	84.7	85.1	+0.4
HumanEval	59.1	62.2	+3.1
MBPP	72.0	73.9	+1.9
CMMLU	82.0	82.9	+0.9
C-Eval	79.6	82.5	+2.9

The gains concentrate on multi-step reasoning and code (GPQA +7.5, Math +3.6, HumanEval +3.1) — consistent with the hypothesis that better depth-wise information flow most helps compositional tasks, where deep layers want to selectively retrieve and build on specific earlier representations rather than a blurred average.

Ablations (436M model)

Variant	Val loss
Baseline (PreNorm)	1.766
DenseFormer (fixed, input-independent scalars)	1.767
Sliding-window over depth ( $W{=}8$ recent layers + embedding)	1.764
mHC ( $m$ streams, learned mixing)	1.747
Block AttnRes ( $S{=}4$ )	1.746
Full AttnRes	1.737
Full AttnRes + input-dependent query	1.731 (adds a $d\times d$ projection)

Several of these are pointed:

DenseFormer ≈ baseline (1.767). Giving every layer access to all earlier outputs but with fixed, input-independent weights buys nothing. The input-dependence is the active ingredient, not the cross-layer access alone.
Sliding-window over depth (1.764) ≫ Full AttnRes (1.737). Restricting attention to the 8 nearest layers beats the baseline only slightly. So selectively reaching distant layers matters far more than attending to many nearby ones — the opposite of a locality prior.
Block size sweep. $S=2,4,8$ all cluster near 1.746; $S=16,32$ degrade toward baseline. Hence the $N\approx 8$ default. As hardware loosens memory limits, finer blocks (toward Full AttnRes) are the natural upgrade path.
Softmax > sigmoid (1.741). The competitive normalization of softmax forces sharper selection. Depth-wise multihead hurts (1.752 vs 1.746) — the optimal mixture is largely uniform across channels: when a layer’s output is relevant, it’s relevant as a whole. Input-independent mixing hurts (1.749), removing RMSNorm hurts (1.743/1.750).

Where AttnRes wants to spend capacity

A 25-config sweep at fixed compute and active params (varying $d_\text{model}/L_b$ and $H/L_b$ ) finds AttnRes beats the baseline in all 25 cells (by 0.019–0.063), and — more interesting — shifts the optimum toward deeper, narrower models: the baseline’s best is at $d_\text{model}/L_b \approx 60$ , AttnRes’s at $\approx 45$ . AttnRes can exploit additional depth more effectively, which is exactly what you’d predict from a mechanism that improves depth-wise information flow. (The paper is careful: this is a diagnostic, not a deployment recommendation, since deeper models cost more sequential latency at inference.)

Training dynamics: what actually changes

Comparing baseline vs. Block AttnRes on the 48B run over 1T tokens makes the mechanism legible:

Validation loss is consistently lower, and the gap widens during the LR decay phase — AttnRes converts the annealing into more gain.
Output magnitude. The baseline shows textbook PreNorm dilution: hidden-state magnitude grows monotonically with depth, forcing deep layers to emit ever-larger outputs. Block AttnRes confines that growth within a block — selective re-aggregation at each block boundary resets the accumulation, producing a bounded, periodic magnitude profile instead of a monotone ramp.
Gradient magnitude. With all residual weights pinned to 1, the baseline has no way to regulate gradient flow and dumps disproportionately large gradients into the earliest layers. AttnRes’s learned softmax weights make sources compete for probability mass, which spreads gradients far more uniformly across depth.

And the learned weight patterns $\alpha_{i\to l}$ are interpretable: strong diagonal (locality is preserved — each layer still leans on its immediate predecessor), but with clear off-diagonal concentrations (learned skip connections to specific distant layers), persistent non-trivial weight on the embedding throughout, and a layer-type split — pre-attention layers keep broad receptive fields while pre-MLP layers rely more sharply on recent representations. All of this structure transfers intact from Full to Block AttnRes, which is why block compression behaves like a regularizer rather than a lobotomy.

When to use which

	Full AttnRes	Block AttnRes
Best for	maximum quality; smaller models; future fast interconnects	practical large-scale training and serving today
Quality	slightly better (gap → 0.001 at 528M)	recovers most of Full’s gains at $N\approx 8$
Training overhead	$O(Ld)$ cross-stage comm — impractical at scale now	< 4% under pipeline parallelism
Inference I/O	$\approx 24d$ /layer	$\approx 5.5d$ /layer (vs. $3d$ standard, $34d$ for mHC)
Inference latency	practical but higher	< 2%

The honest summary: Block AttnRes with ~8 blocks is the recommended production variant, and Full AttnRes is the theoretical ceiling that better interconnect hardware will eventually make practical. The two are the same mechanism at different points on the $N$ interpolation.

Takeaways

The residual connection was the last fixed-weight aggregator in the transformer, and it didn’t have to be. AttnRes makes depth-wise mixing learned and input-dependent with astonishingly little machinery — one $d$ -vector and one RMSNorm per layer.
The time–depth duality is more than an analogy; it’s a generative principle. It predicts the whole family: prior residual variants are depth-wise linear attention, AttnRes is depth-wise softmax attention, and the structured-matrix rank ( $1 \to m \to L$ ) tracks the move exactly. It also predicts new designs — linear-complexity depth attention is the obvious next step the paper flags.
The contribution is as much systems as it is modeling. Cross-stage caching, the two-phase online-softmax inference schedule, and sequence-sharded prefilling are what turn a $O(Ld)$ idea into a $<4\%$ / $<2\%$ drop-in. Without them, Full AttnRes is a single-GPU curiosity.
The wins are exactly where you’d hope. Multi-step reasoning and code — the compositional tasks where a deep layer genuinely wants to retrieve a specific earlier representation rather than a flat average — improve most. PreNorm dilution is mitigated visibly in the magnitude and gradient profiles, not just in the loss number.

If you’ve been reading the recent attention literature, the satisfying thing about this paper is that it closes a symmetry. We spent a decade making the sequence axis learned and content-aware while leaving the depth axis on the fixed recurrence it inherited from 2015. AttnRes is the argument that depth deserves the same treatment — and that, because depth is small, it’s nearly free to give.

References

Kimi Team (Moonshot AI). Attention Residuals. arXiv:2603.15031, 2026. (code)
Zhang, Y. et al. Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.xxxxx, 2025.
He, K. et al. Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015.
Xiong, R. et al. On Layer Normalization in the Transformer Architecture. arXiv:2002.04745, 2020.
Srivastava, R., Greff, K., Schmidhuber, J. Highway Networks. arXiv:1505.00387, 2015.
Zhu, D. et al. Hyper-Connections. arXiv:2409.19606, 2024.
Pagliardini, M. et al. DenseFormer: Enhancing Information Flow in Transformers via Depth-Weighted Averaging. arXiv:2402.02622, 2024.
Sun, Y. et al. Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv:2407.04620, 2024.
Dao, T., Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2). arXiv:2405.21060, 2024.
Milakov, M., Gimelshein, N. Online normalizer calculation for softmax. arXiv:1805.02867, 2018.