Flash Attention

Fast and Memory-Efficient Exact Attention with IO-Awareness

Flash attention guarantees exact attention, meaning the results obtained are strictly identical to the standard/vanilla attention calculations. This stands in contrast to some other algorithms that enhance computation speed by compromising the precision of attention. Lastly, by IO-awareness it indicates the improvements come from increasing IO efficiency.

Motivation

First let’s recap the standard procedure of computing attention in the Transformer architecture.

It starts with the input tokens’ vectors, generating the matrices Q, K, V.

Then, by multiplying Q with the transpose of K, we obtain S.

Next, S undergoes row-wise softmax operation to get the attention matrix P.

Finally, P is multiplied by V to get the attention output O.

For the sake of simplification, unless specified otherwise, the following discussion does not consider scaled dot product, multi-head attention, or dropout.

Let’s see how attention is computed on a physical GPU using PyTorch code.

Q, K, and V matrices are stored in HBM (High Bandwidth Memory) with shapes N \times d,

where N is sequence length and d is feature dimension. The process is as follows:

Load Q, K matrices from HBM to SRAM.
Compute S as Q \cdot K^T.
Write S back to HBM.
Load S from HBM to SRAM.
Compute P as Softmax of S.
Write P back to HBM.
Load P and V from HBM to SRAM.
Compute O as P \cdot V.
Write O back to HBM.
Return O.

There are many temporary variable reads/writes, like matrices S and P, whose sizes grow quadratically with sequence length.

These intermediate results are necessary for gradient computation during backpropagation.

compute/memory bandwidths in attention calculation

Training speed constraints include Compute Bound and Memory Bound scenarios.

Compute Bound constraints are due to operations like large matrix multiplications and multi-channel convolutions, which require minimal data but are computationally intensive.

Memory Bound constraints occur when HBM data read speeds cannot keep up with computation speeds, leading to idle computational resources.

Main operations include element-wise operations like ReLU and Dropout and reduction operations like SUM and Softmax.

Attention calculations are mostly Memory Bound.

Optimizations for Memory Bound scenarios involve fusing multiple operations (Fusion) to reduce HBM access time, allowing multiple operations to access HBM only once. However, intermediate results needed for backpropagation are recomputed rather than stored to save HBM access time. Those interested can refer to my earlier videos on gradient checkpointing.

Memory in GPUs is hierarchical, with fast-access on-chip cache and slower-access off-chip HBM. To optimize IO speed, computations should access on-chip cache as much as possible while reducing off-chip HBM access. Refer to my previous video on GPU architecture for more information.

Earlier Attention improvements focused on reducing computation. FlashAttention focuses on reducing IO access and accelerating IO speed via on-chip cache. Its goal is to avoid Attention operations’ HBM read/writes by:

Matrix partitioning and fusing all Attention operations without caching intermediate results in HBM.
Recomputing intermediate results during backpropagation to mitigate the computational cost of not caching them.

These improvements have enhanced training speed 2-4 times and reduced memory usage from quadratic to linear growth with sequence length. At a sequence length of 4096, it saves 20 times the memory compared to PyTorch.

Consider A100-40GB SXM, which has a compute bandwidth of 312TFLOPS and memory bandwidth of 1555GB/s and mixed precision training, the operational intensity for the particular hardware is

Suppose we are to calculate attention for S=QK^{T},

For more details refer to this blog. The bottleneck may vary for different values of N and d.

N	d	operations/bytes	bottleneck
256	128	64	memory bound
1024	128	102	memory bound
4096	128	120	memory bound
256	256	85	memory bound
1024	256	171	memory bound
4096	256	228	compute bound
256	512	102	memory bound
1024	512	256	compute bound
4096	512	410	compute bound

Approach

Through block computation and fusion, reduce the cache of intermediate results onto HBM.
Recompute the required intermediate results during backward pass.

hardware prerequisites

GPU SRAM(Static Random-Access Memory): 19TB/s(20MB)
GPU HBM(High Bandwidth Memory): 1.5TB/s(40GB)
CPU DRAM: 12.8GB/s(>1TB)

Algorithm FlashAttention

Set block sizes B_c = \left\lfloor \frac{M}{4d} \right\rfloor, B_r = \min \left(\left\lfloor \frac{M}{4d} \right\rfloor, d \right).
Initialize \mathbf{O} = (0)_{N \times d} \in \mathbb{R}^{N \times d}, \ell = (0)_{N} \in \mathbb{R}^{N}, m = (-\infty)_{N} \in \mathbb{R}^{N} in HBM.
Divide \mathbf{Q} into T_r = \left\lceil \frac{N}{B_r} \right\rceil blocks \mathbf{Q}_1, \ldots, \mathbf{Q}_{T_r} of size B_r \times d each, and divide \mathbf{K}, \mathbf{V} in to T_c = \left\lceil \frac{N}{B_c} \right\rceil blocks \mathbf{K}_1, \ldots, \mathbf{K}_{T_c} and \mathbf{V}_1, \ldots, \mathbf{V}_{T_c}, of size B_c \times d each.
Divide \mathbf{O} into T_r blocks \mathbf{O}_1, \ldots, \mathbf{O}_{T_r} of size B_r \times d each, divide \ell into T_r blocks \ell_i, \ldots, \ell_{T_r} of size B_r each, divide m into T_r blocks m_1, \ldots, m_{T_r} of size B_r each.
for 1 \leq j \leq T_c do
\quadLoad \mathbf{K}_j, \mathbf{V}_j from HBM to on-chip SRAM.
\quadfor 1 \leq i \leq T_r do
\quad\quadLoad \mathbf{Q}_i, \mathbf{O}_i, \ell_i, m_i from HBM to on-chip SRAM.
\quad\quadOn chip, compute S_{ij} = \mathbf{Q}_i \mathbf{K}_j^{\top} \in \mathbb{R}^{B_r \times B_c}.
\quad\quadOn chip, compute \tilde{m}_{ij} = \text{rowmax}(S_{ij}) \in \mathbb{R}^{B_r}, \tilde{P}_{ij} = \exp(S_{ij} - \tilde{m}_{ij}) \in \mathbb{R}^{B_r \times B_c} (pointwise), \tilde{\ell}_{ij} = \text{rowsum}(\tilde{P}_{ij}) \in \mathbb{R}^{B_r}.
\quad\quadOn chip, compute m_i^{\text{new}} = \max(m_i, \tilde{m}_{ij}) \in \mathbb{R}^{B_r}, \ell_i^{\text{new}} = e^{m_i - m_i^{\text{new}}} \ell_i + e^{\tilde{m}_{ij} - m_i^{\text{new}}} \tilde{\ell}_{ij} \in \mathbb{R}^{B_r}.
\quad\quadWrite \mathbf{O}_i \leftarrow \text{diag}(\ell_i^{\text{new}})^{-1} (\text{diag}(\ell_i) e^{m_i - m_i^{\text{new}}} \mathbf{O}_i + e^{\tilde{m}_{ij} - m_i^{\text{new}}} \tilde{P}_{ij} \mathbf{V}_j).
\quad\quadWrite \ell_i \leftarrow \ell_i^{\text{new}}, m_i \leftarrow m_i^{\text{new}} to HBM.
\quadend for
end for
Return \mathbf{O}.

Online Softmax

safe softmax

Vanilla softmax could encounter numerical instability, hence a safe version of softmax is used in flash attention where the maximum of the vector is subtracted before exponentiation.

3-pass softmax

For i=1, ..., N,

2-pass softmax

readme

图解大模型计算加速系列：Flash Attention V1，从硬件到计算逻辑

We’ll now explore how matrix partitioning and computing fusion reduces HBM access. We bypass Softmax initially for its complexity in partitioned computation, discussed later. Assume results directly as .

The process starts by reading the first two rows of Q, three columns of , and three columns of V from HBM to SRAM for computation. yields , which isn’t stored in HBM, used immediately with V partitions, resulting not in final ‘s two rows but intermediate results updated later.

Keep and V partitions in SRAM, load ‘s middle two rows from HBM, compute similarly for ‘s middle rows, repeat for final rows. Load last columns of , last rows of V, previous Q rows, update using earlier intermediate results. Maintain , V partitions, load next Q chunk, compute, update .

Through partitioning and fusion, we avoid storing intermediate in HBM, significantly reducing IO time. Except for Softmax, row-wise computation requiring complete data for summation, crucial for fusion requires solving Softmax’s partition computation.

Softmax computation in mixed-precision training (FP16) risks overflow for large exponents, fixed by Safe Softmax. It adjusts all terms by the largest value to prevent overflow. Simplifies to expressions where exponents are non-positive, manageable in FP16.

Safe Softmax involves finding maximum , transforming , and summing for normalization. For partitioned values, . Adjust with coefficients, ensuring correct global maximum adjustments. Sum adjusted partitions for correct Softmax.

Though requiring extra variables per row, storage is minor. The balancing of computation with IO reduction is favorable.

Pseudo-code overview: QKV stored in HBM, SRAM size . Column block size . 4 for QKVO partitions. Row block size to control Q block size. Initialize in HBM, partition QKV, .

Outer loop KV, inner Q matches our animations. Load K, V partition to SRAM, loop Q, O, l, partitions, compute Q block, , , update max values, using ‘s inverse matrix, update in HBM.

Backpropagation saves Softmax’s intermediate for quick recomputation during partitioned backpropagation, akin to gradient checkpointing.

Tip

FlashAttention increases computation slightly but drastically reduces HBM IO, markedly decreasing training time. FlashAttention2 follows similar principles, with further optimizations, reducing non-matrix computation, reversing Q/KV loops, enhancing parallelism, exploiting partition advantages, skipping computations for masked upper-triangle, further reducing computation.

📚 Jiaqi's Knowledge Repository

Explorer

Flash Attention

Motivation

compute/memory bandwidths in attention calculation

Approach

hardware prerequisites

Online Softmax

safe softmax

3-pass softmax

2-pass softmax

readme

Graph View

Table of Contents