Posts
All the articles I've posted.
-
Hybrid Attention and MLA: The Tradeoff
A side-by-side dive into Xiaomi MiMo's hybrid sliding-window/global attention and DeepSeek's Multi-head Latent Attention. The two answer the same question — how to make attention affordable at long context — with very different bets, and those bets shape everything from training infra to KV cache size.
-
Kimi K2.5: Joint Text–Vision Training and the Agent Swarm
A walkthrough of two ideas behind Kimi K2.5: how joint text–vision pre-training and RL make each modality help the other, and how Agent Swarm replaces sequential tool use with a learned parallel orchestrator.
-
Inside DeepSeek's Sparse Attention: From NSA to DSA
A deep dive into DeepSeek's two sparse attention designs — Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA) — covering the math, the hardware story, and why DSA in V3.2 looks so different from NSA.
-
AstroPaper 6.0
AstroPaper v6: a from-scratch rewrite on Astro v6, Tailwind v4, and a new config system.