Tag: transformers

All the articles with the tag "transformers".

Inside DSpark: DeepSeek's Confidence-Scheduled Speculative Decoding

28 Jun, 2026

A deep dive into DSpark, DeepSeek's new draft model for speculative decoding. We cover what it actually is — a semi-autoregressive drafter paired with a confidence-scheduled, load-aware verifier — how it differs from vanilla speculative decoding, Medusa, EAGLE-3 and parallel drafters like DFlash, and why it delivers 60–85% faster per-user generation inside the DeepSeek-V4 serving stack.
Inside GLM-5.2: IndexShare, KVShare, and the End-to-End TV Loss

21 Jun, 2026

A deep dive into GLM-5.2 — a 753B open-weight MoE that serves a 1M-token context. We walk the three innovations that make it cheap to run: IndexShare (cross-layer sparse-attention index reuse), KVShare + rejection sampling for speculative decoding, and a novel end-to-end TV loss that breaks the entropy bound on MTP acceptance. Plus the slime RL stack behind its long-horizon agentic skills.
From AlexNet to World Models: The Evolution of Multi-Modal Neural Networks

2 Jun, 2026

A ground-up tour of how neural networks learned to see, then to see-and-read, and finally to imagine. From AlexNet and CNNs, through CLIP and the vision-language models behind GPT-4V, to world models like Dreamer, V-JEPA 2, and LeWorldModel — with architectures, math, and benchmark numbers along the way.
Attention Residuals: Softmax Attention Over Depth

1 Jun, 2026

A deep dive into the Kimi team's Attention Residuals (AttnRes) — replacing the fixed-weight residual connection with learned softmax attention over depth. Covers the time–depth duality, Full vs Block AttnRes, the structured-matrix view that unifies prior residual variants, the pipeline-parallel infra that makes it practical, and the scaling-law and 48B-MoE results.

Tag: transformers

Inside DSpark: DeepSeek's Confidence-Scheduled Speculative Decoding

Inside GLM-5.2: IndexShare, KVShare, and the End-to-End TV Loss

From AlexNet to World Models: The Evolution of Multi-Modal Neural Networks

Attention Residuals: Softmax Attention Over Depth