Skip to content
AstroPaper

Archives

All the articles I've archived.

2026 12
June 3
  • Top Interview 150 — Solutions in Python

    Worked Python solutions to the LeetCode Top Interview 150, organized by topic with a short approach for each problem.

  • From AlexNet to World Models: The Evolution of Multi-Modal Neural Networks

    A ground-up tour of how neural networks learned to see, then to see-and-read, and finally to imagine. From AlexNet and CNNs, through CLIP and the vision-language models behind GPT-4V, to world models like Dreamer, V-JEPA 2, and LeWorldModel — with architectures, math, and benchmark numbers along the way.

  • Attention Residuals: Softmax Attention Over Depth

    A deep dive into the Kimi team's Attention Residuals (AttnRes) — replacing the fixed-weight residual connection with learned softmax attention over depth. Covers the time–depth duality, Full vs Block AttnRes, the structured-matrix view that unifies prior residual variants, the pipeline-parallel infra that makes it practical, and the scaling-law and 48B-MoE results.

May 9
  • GRPO and DAPO: A Deep Dive into RL for Reasoning LLMs

    An end-to-end walkthrough of Group Relative Policy Optimization (GRPO) and Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) — the two RL algorithms that drive open reasoning models in 2025–2026. Full math, every design choice motivated, and a head-to-head comparison.

  • From GRPO to GSPO: Group-Based Policy Optimization for LLMs

    A complete walkthrough of Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO) — the policy-gradient algorithms behind DeepSeek-R1 and Qwen3. Full math, the failure mode that motivated GSPO, the MoE story, and a side-by-side comparison.

  • GRPO and Dr.GRPO: The Math, the Biases, and the Fix

    An end-to-end derivation of Group Relative Policy Optimization (GRPO) from DeepSeekMath and the Dr.GRPO correction from Liu et al. Covers the full objective, the gradient, the two biases (length and question difficulty), the unbiased fix, and the practical recipe behind R1-Zero–style training.

  • Training Composer 2: How Cursor Builds a Coding Agent Model

    A structured walkthrough of Sasha Rush's Training Composer 2 workshop: why Cursor chose Kimi K2.5, how continued pretraining and long-horizon RL fit together, what CursorBench measures, and where Composer is headed.

  • Leetcode Problems

    Leetcode grinding.

  • Hybrid Attention and MLA: The Tradeoff

    A side-by-side dive into Xiaomi MiMo's hybrid sliding-window/global attention and DeepSeek's Multi-head Latent Attention. The two answer the same question — how to make attention affordable at long context — with very different bets, and those bets shape everything from training infra to KV cache size.

  • Kimi K2.5: Joint Text–Vision Training and the Agent Swarm

    A walkthrough of two ideas behind Kimi K2.5: how joint text–vision pre-training and RL make each modality help the other, and how Agent Swarm replaces sequential tool use with a learned parallel orchestrator.

  • Inside DeepSeek's Sparse Attention: From NSA to DSA

    A deep dive into DeepSeek's two sparse attention designs — Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA) — covering the math, the hardware story, and why DSA in V3.2 looks so different from NSA.

  • AstroPaper 6.0

    AstroPaper v6: a from-scratch rewrite on Astro v6, Tailwind v4, and a new config system.

2025 1
March 1
  • AstroPaper 5.0

    AstroPaper v5: keep the clean look, updates under the hood.

2024 4
September 1
July 1
January 2
2023 3
September 1
  • AstroPaper 3.0

    AstroPaper Version 3: Elevating Your Web Experience with Astro v3 and Seamless View Transitions

July 1
January 1
  • AstroPaper 2.0

    AstroPaper with the enhancements of Astro v2. Type-safe markdown contents, bug fixes and better dev experience etc.

2022 8
December 1
September 4
July 1
June 1
March 1