Flash Attention

此内容尚不支持你的语言。

Motivation

大家好今天我们来讨论一下FlashAttention 它现在是训练大模型默认采用的技术从FlashAttention的论文题目就可以看出它有两大优势一,Fast 它可以加快模型训练的速度二,Memory Efficient 显存高效的它可以减少显存的占用并且它保证Exact Attention 也就是它和标准的Attention计算得到的结果是完全一致的并不像其他一些算法是以降低Attention的精度来提高训练速度的最后,With IO Awareness 说明了它是通过对IO感知的方式来进行训练的也就是整个算法是以改进IO效率来达到的这个论文题目写的确实不错严简易赅首先我们看一下Transformer架构里一个Attention的标准计算过程首先是输入的Token的向量然后生成Q,K,V矩阵然后通过Q乘以K的转制得到S 接着对S按行求Softmax 接着得到注意力矩阵P 最后P乘以V得到注意力输出O 这里我们为了讲解方便,对整个过程进行了简化没有对S做Scale,也没有多头注意力机制,也没有Dropout 那我们用PyTorch写的代码在实际显卡上的Attention是如何计算的呢? 首先,矩阵Q,K,V存储在HBM里它们的形状都是N乘以D N是序列长度,D是特征维度第一步,从HBM加载Q,K矩阵到SRAM 第二步,计算出S,它等于Q乘以K的转制第三步,将S写到HBM 第四步,将S加载到SRAM 第五步,计算出P,它等于Softmax S 第六步,将P写出到HBM 第七步,从HBM加载P和V到SRAM 第八步,计算O,它等于P乘以V 第九步,把O写出到HBM 第十步,返回O 可以看到中间有很多临时变量的读写比如S和P矩阵它们的大小都是随着序列长度的平方增长的比如,拉马森8B 支持的序列长度为8192 每个头的维度为128 中间的S,P,临时矩阵占用的显存非常大保留中间结果,比如S,P,会占用显存但是还是需要的因为反向传播时需要它们来计算梯度

First let’s recap the standard procedure of computing attention in the Transformer architecture. It starts with the input tokens’ vectors, generating the matrices $Q$ , $K$ , $V$ . Then, by multiplying $Q$ with the transpose of $K$ , we obtain $S$ . Next, $S$ undergoes row-wise softmax operation to get the attention matrix $P$ . Finally, $P$ is multiplied by $V$ to get the attention output $O$ . For the sake of simplification, unless specified otherwise, the following discussion does not consider scaled dot product, multi-head attention, or dropout.

Let’s see how attention is computed on a physical GPU using PyTorch code. $Q$ , $K$ , and $V$ matrices are stored in HBM (High Bandwidth Memory) with shapes $N \times d$ , where $N$ is sequence length and $d$ is feature dimension. The process is as follows:

Load $Q$ , $K$ matrices from HBM to SRAM.
Compute $S$ as $Q \cdot K^T$ .
Write $S$ back to HBM.
Load $S$ from HBM to SRAM.
Compute $P$ as Softmax of $S$ .
Write $P$ back to HBM.
Load $P$ and $V$ from HBM to SRAM.
Compute $O$ as $P \cdot V$ .
Write $O$ back to HBM.
Return $O$ .

There are many temporary variable reads/writes, like matrices $S$ and $P$ , whose sizes grow quadratically with sequence length. These intermediate results are necessary for gradient computation during backpropagation.

compute/memory bandwidths in attention calculation

在模型训练时制约训练速度有两种情况一种情况是Compute Bound 训练速度的瓶颈在于运算比如对于大的矩阵乘法还有多Channel的卷积操作这些操作都是需要的数据量不大但是计算很复杂还有另一种情况是Memory Bound 训练速度的瓶颈在于对对于HBM数据的读取速度从HBM读取数据的速度跟不上运算的速度算力在等待数据主要的操作包括两类一类是安慰的操作比如软路和Dropout 还有一类是规约操作比如SUM, Softmax等这些操作都是需要数据很多但是计算相对简单 Attention计算操作主要是Memory Bound的计算可以看到 Compute Bound的操作比如矩阵乘法只占用了非常少的时间但是Memory Bound的计算占据了很多的时间对于Memory Bound的优化主要通过融合多个操作也叫做Fusion 它节约了原来多个操作之间要存取HBM的时间让多个操作只要存取一次HBM 但是中间结果我们之前说过它在反向传播时是有用的但这里为了节省访问HBM的时间我们不保存中间结果在反向传播时重新计算这里不清楚的话可以看我之前讲梯度检查点的视频显存里的存储是分级的有芯片内的缓存还有芯片外的HBM显存芯片内缓存容量小但是访问快芯片外的HBM容量大但是访问慢所以对于优化IO速度应该尽可能的让计算访问芯片内的缓存而尽可能减少访问芯片外的HBM的显存这里不清楚的同学可以看我上一期讲的 GPU原理的视频之前很多对Attention改进的算法都着眼于减少计算量但是FlashAttention 着眼于减少IO访问量以及通过访问芯片内缓存而加快IO的速度所以FlashAttention的目标就是尽量避免Attention操作从HBM来读写它是通过以下两点来实现的一通过对矩阵分块并且融合Attention内的所有操作不缓存中间结果到HBM 来加快速度二通过在反向传播时重新计算中间结果来解决不缓存中间结果带来的7度计算问题通过这两点改进它达到了2到4倍的训练速度的提升而且将原来训练时对显存的占用随序列长度平方增长减小为随序列长度线性增长可以看到序列长度越长它对显存节省越多序列长度为4096时它比PyTorch的实现节省20倍的显存占用

Training speed constraints include Compute Bound and Memory Bound scenarios. Compute Bound constraints are due to operations like large matrix multiplications and multi-channel convolutions, which require minimal data but are computationally intensive. Memory Bound constraints occur when HBM data read speeds cannot keep up with computation speeds, leading to idle computational resources. Main operations include element-wise operations like ReLU and Dropout and reduction operations like SUM and Softmax. Attention calculations are mostly Memory Bound.

Optimizations for Memory Bound scenarios involve fusing multiple operations (Fusion) to reduce HBM access time, allowing multiple operations to access HBM only once. However, intermediate results needed for backpropagation are recomputed rather than stored to save HBM access time. Those interested can refer to my earlier videos on gradient checkpointing.

Memory in GPUs is hierarchical, with fast-access on-chip cache and slower-access off-chip HBM. To optimize IO speed, computations should access on-chip cache as much as possible while reducing off-chip HBM access. Refer to my previous video on GPU architecture for more information.

Earlier Attention improvements focused on reducing computation. FlashAttention focuses on reducing IO access and accelerating IO speed via on-chip cache. Its goal is to avoid Attention operations’ HBM read/writes by:

Matrix partitioning and fusing all Attention operations without caching intermediate results in HBM.
Recomputing intermediate results during backpropagation to mitigate the computational cost of not caching them.

These improvements have enhanced training speed 2-4 times and reduced memory usage from quadratic to linear growth with sequence length. At a sequence length of 4096, it saves 20 times the memory compared to PyTorch.

Consider A100-40GB SXM, which has a compute bandwidth of 312TFLOPS and memory bandwidth of 1555GB/s and mixed precision training, the operational intensity for the particular hardware is

\begin{align*} \frac{\pi}{\beta}=\frac{312*1e12}{1555*1e9}=201\text{ FLOPS/Bytes} \end{align*}

Suppose we are to calculate attention for $S=QK^{T}$ ,

\begin{align*} \frac{\pi_{t}}{\beta_{t}}=\frac{2N^{2}d}{2Nd+2Nd+2N^{2}}=\frac{N^{2}d}{2Nd+N^{2}} \end{align*}

For more details refer to this blog. The bottleneck may vary for different values of $N$ and $d$ .

\begin{array}{c|c|c|c} \text{N} & \text{d} & \text{operations/bytes} & \text{bottleneck} \\ \hline 256 & 128 & 64 & \text{memory bound} \\ 1024 & 128 & 102 & \text{memory bound} \\ 4096 & 128 & 120 & \text{memory bound} \\ 256 & 256 & 85 & \text{memory bound} \\ 1024 & 256 & 171 & \text{memory bound} \\ 4096 & 256 & \underline{228} & \underline{\text{compute bound}} \\ 256 & 512 & 102 & \text{memory bound} \\ 1024 & 512 & \underline{256} & \underline{\text{compute bound}} \\ 4096 & 512 & \underline{410} & \underline{\text{compute bound}} \\ \end{array}

Approach

Through block computation and fusion, reduce the cache of intermediate results onto HBM.
Recompute the required intermediate results during backward pass.

hardware prerequisites

GPU SRAM(Static Random-Access Memory): 19TB/s(20MB)
GPU HBM(High Bandwidth Memory): 1.5TB/s(40GB)
CPU DRAM: 12.8GB/s(>1TB)

Algorithm FlashAttention

Set block sizes $B_c = \left\lfloor \frac{M}{4d} \right\rfloor$ , $B_r = \min \left(\left\lfloor \frac{M}{4d} \right\rfloor, d \right)$ .
Initialize $\mathbf{O} = (0)_{N \times d} \in \mathbb{R}^{N \times d}$ , $\ell = (0)_{N} \in \mathbb{R}^{N}$ , $m = (-\infty)_{N} \in \mathbb{R}^{N}$ in HBM.
Divide $\mathbf{Q}$ into $T_r = \left\lceil \frac{N}{B_r} \right\rceil$ blocks $\mathbf{Q}_1, \ldots, \mathbf{Q}_{T_r}$ of size $B_r \times d$ each, and divide $\mathbf{K}, \mathbf{V}$ in to $T_c = \left\lceil \frac{N}{B_c} \right\rceil$ blocks $\mathbf{K}_1, \ldots, \mathbf{K}_{T_c}$ and $\mathbf{V}_1, \ldots, \mathbf{V}_{T_c}$ , of size $B_c \times d$ each.
Divide $\mathbf{O}$ into $T_r$ blocks $\mathbf{O}_1, \ldots, \mathbf{O}_{T_r}$ of size $B_r \times d$ each, divide $\ell$ into $T_r$ blocks $\ell_i, \ldots, \ell_{T_r}$ of size $B_r$ each, divide $m$ into $T_r$ blocks $m_1, \ldots, m_{T_r}$ of size $B_r$ each.
for $1 \leq j \leq T_c$ do
$\quad$ Load $\mathbf{K}_j, \mathbf{V}_j$ from HBM to on-chip SRAM.
$\quad$ for $1 \leq i \leq T_r$ do
$\quad\quad$ Load $\mathbf{Q}_i, \mathbf{O}_i, \ell_i, m_i$ from HBM to on-chip SRAM.
$\quad\quad$ On chip, compute $S_{ij} = \mathbf{Q}_i \mathbf{K}_j^{\top} \in \mathbb{R}^{B_r \times B_c}$ .
$\quad\quad$ On chip, compute $\tilde{m}_{ij} = \text{rowmax}(S_{ij}) \in \mathbb{R}^{B_r}$ , $\tilde{P}_{ij} = \exp(S_{ij} - \tilde{m}_{ij}) \in \mathbb{R}^{B_r \times B_c}$ (pointwise), $\tilde{\ell}_{ij} = \text{rowsum}(\tilde{P}_{ij}) \in \mathbb{R}^{B_r}$ .
$\quad\quad$ On chip, compute $m_i^{\text{new}} = \max(m_i, \tilde{m}_{ij}) \in \mathbb{R}^{B_r}$ , $\ell_i^{\text{new}} = e^{m_i - m_i^{\text{new}}} \ell_i + e^{\tilde{m}_{ij} - m_i^{\text{new}}} \tilde{\ell}_{ij} \in \mathbb{R}^{B_r}$ .
$\quad\quad$ Write $\mathbf{O}_i \leftarrow \text{diag}(\ell_i^{\text{new}})^{-1} (\text{diag}(\ell_i) e^{m_i - m_i^{\text{new}}} \mathbf{O}_i + e^{\tilde{m}_{ij} - m_i^{\text{new}}} \tilde{P}_{ij} \mathbf{V}_j)$ .
$\quad\quad$ Write $\ell_i \leftarrow \ell_i^{\text{new}}, m_i \leftarrow m_i^{\text{new}}$ to HBM.
$\quad$ end for
end for
Return $\mathbf{O}$ .

Online Softmax

safe softmax

Vanilla softmax could encounter numerical instability, hence a safe version of softmax is used in flash attention where the maximum of the vector is subtracted before exponentiation.

\begin{align*} \text{softmax}(x_{i})=\frac{\exp\{x_{i}-\max(x)\}}{\sum_{j=1}^{n} \exp\{x_{j}-\max(x)\}} \end{align*}

3-pass softmax

For $i=1, \cdots, N$ ,

\begin{align*} m_{i}:=\max(m_{i-1}, x_{i}) \end{align*}

For $i=1, \cdots, N$ ,

\begin{align*} d_{i}:=d_{i-1}+\exp\{x_{i}-m_{N}\} \end{align*}

For $i=1, \cdots, N$ ,

\begin{align*} a_{i}:=\exp\{x_{i}-m_{N}\}/d_{N} \end{align*}

2-pass softmax

图解大模型计算加速系列：Flash Attention V1，从硬件到计算逻辑

下面我们看一下如何通过矩阵分块和融合多个计算来减少对HBM的访问这里我们先跳过Softmax的操作因为它比较特殊分块计算比较麻烦后面来单独讨论我们先认为结果直接就是S乘以V 我们看一下整个过程开始时我们从HBM里读取Q的前两行 K转制的前三列 V的前三列然后传入到SRAM上对它们进行计算 Q乘以K的转制得到S 并不存入HBM 然后直接和V的分块进行计算这时我们得到的并不是最终的结果 O的前两行因为我们知道 O是对所有的V的一个加权平均目前只是对V的前三行进行加权平均后边我们还要对O进行更新这里我们用浅一点的颜色来表示这个值还只是一个中间结果接着K和V的分块还保留在SRAM里从HBM里读取O的中间 Q的中间两行然后经过同样的计算得到O的第三和第四行的中间结果然后继续保留K和V的分块在SRAM里从HBM里读取Q的最后两行经过同样的计算得到O的最后两行的中间结果接下来读取K转制的后三列 V的后三行 Q的前两行进行计算这里算出的O是对V的后三行值的加权平均再从HBM里读取O之前保存的中间结果也就是对V的前三行的加权平均值进行加核这就是O最终的结果了同样保持K和V的分块不变从HBM里读取下一个分块的Q 进行计算从HBM里读取之前计算的中间结果 O 加核更新后存入HBM 最后继续保持SRAM里的K和V分块不变从HBM里读取最后一个分块的Q 进行计算从HBM里读取之前计算的中间结果 O 加核更新后存入HBM 这时就完成了一个Attention的计算我们可以发现通过将矩阵分块以及将多部计算进行融合中途没有将中间计算结果 S存入HBM 大大减少了IO的时间这一切看起来都不错除了Softmax Softmax是暗行进行的只有一行所有的数据都计算完成后才能进行这里的求核计算所以我们想要让我们之前的矩阵分块对Attention多部进行融合计算得以进行的前提是必须解决Softmax的分块计算问题下面我们就来讨论Softmax的分块计算现在我们训练都是混合精度在FP16下进行如果X等于12 则E的X次方就大于FP16能表示的最大的数了为了解决这个数值溢出问题人们提出了一种叫做 Safe Softmax的算法首先找出从X1到Xn里最大的值m 给Softmax计算公式的分子和分母同时除以E的m次方 Softmax的结果不变进行化简就变成了后面的表达式 E的指数部分就都小于等于0了这时用FP16表示就不会有数值溢出的问题了我们再看一下 Safe Softmax的计算过程有一组X 通过mx求出X里的最大值通过px将X变化成E的xi减去mx 通过Lx对所有的px求和最终的Softmax值就是px除以Lx 接下来我们看一下如果有一组X 它有In个我们将它分成两个部分第一个部分是从1到n 第二个部分是从n加1到2n 然后对两个块分别计算它们的mx最大值 pxE的x次方减去mx Lxpx的求和然后我们看是否能求出对于原始In个X的正确的Softmax的值首先mx等于max mx1 mx2 我们需要两个临时变量保存两个分块里的最大值 mx1 mx2 就可以保证最后可以求出全局最大值mx了第二步我们把px1和px2拼接起来然后给它们都乘以一个系数因为mx是mx1和mx2里较大的值所以它肯定等于mx1 或者mx2 我们假设mx等于mx2 那么后面E的指数项就等于0 E的0次方等于1 后边就是px2不变这里很好理解因为px2分块计算的时候它的x减去的就是全局最大值所以此时不需要再进行调整那么对于px1的部分它在分块计算时它的x减去的是局部最大值不是全局最大值那么它和全局最大值比少减了多少就是这里的mx1减去mx 现在给它补回来这样乘以这个系数后 px1就被调整为减去全局最大值的正确结果了对于Lx的计算也是同样的道理把Lx1和Lx2调整后加起来得到全局的正确的加和值最后用px除以Lx 就得到了2n个x 正确的softmax的值了可以看到softmax 也可以通过分块来计算了只是我们需要额外保存几个变量 mx1 mx2 Lx1 Lx2 不过他们对于softmax 每行都各占用一个数字存储占用非常小另外就是在分块进行合并时需要额外的调整计算增加了计算量但是这些计算量相对于减少的IO时间都是非常划算的我们看一下伪代码的实现 QKV存储在HBM上它们的维度都是nxd GPU芯片上的sRAM大小为m 然后确定列分块大小为m除以4d 向上取整为什么是4 因为要存放QKVO 4个分块矩阵行分块大小为m除以4d 和d取小的值这里之所以要和d相比取小的值是为了控制Q矩阵分块最大就是一个d乘以d的方阵不要让Q分块矩阵的行太大从而让在sRAM里生成的中间矩阵计算结果太大而超出sRAM的大小在HBM里初始化OLM 然后对QKV进行分块对OLM进行分块然后进入循环外循环是KV 内循环是Q 这里和我们之前动画演示的例子是一致的首先从HBM里读取K和V的分块到sRAM 然后循环读取QOLM的分块到sRAM 计算分块的Q乘以K的转制然后计算分块的MPL 然后根据之前分块已经计算出来的最大值和当前分块的最大值进行比较找出新的最大值同时更新L的值接着更新O的值对角矩阵的逆矩阵等于原始矩阵对角元素的导数构成的矩阵乘以后面的O值相当于给每一行除以每一行的求和值这里给原来的OI 先乘以原来的DAG LI还原回去加上新的O的值再除以新的求和值更新O值到HBM 更新L M到HBM 最后返回O 接着我们看反向传播在前向传播时会保存softmax计算得到的 M和L的中间值这样在反向传播分块计算时就可以快速重新计算激活值了这可以看作是另一种形式的梯度检查点最后我们发现 FlashAttention的计算量比标准实现增加了一些但是对HBM的IO访问量大幅减少了训练的时间更是大幅缩减到原来的1⁶ FlashAttention2 大致思想和FlashAttention1类似增加了一些工程上的优化比如减少了非矩阵乘法的计算将Q改为外循环 KV改为内循环更进一步减少了对HBM的读写增加了并行度最后进一步利用分块计算的优势如果判断一个分块是原始矩阵的上三角部分也就是它是被mask掉的部分那么就不需要进行Attention的计算了从而更进一步减少了计算量

We’ll now explore how matrix partitioning and computing fusion reduces HBM access. We bypass Softmax initially for its complexity in partitioned computation, discussed later. Assume results directly as $S \times V$ .

The process starts by reading the first two rows of Q, three columns of $K^T$ , and three columns of V from HBM to SRAM for computation. $Q \times K^T$ yields $S$ , which isn’t stored in HBM, used immediately with V partitions, resulting not in final $O$ ‘s two rows but intermediate results updated later. Different colors denote intermediate results needing update.

Keep $K$ and V partitions in SRAM, load $Q$ ‘s middle two rows from HBM, compute similarly for $O$ ‘s middle rows, repeat for final rows. Load last columns of $K^T$ , last rows of V, previous Q rows, update $O$ using earlier intermediate results. Maintain $K$ , V partitions, load next Q chunk, compute, update $O$ .

Through partitioning and fusion, we avoid storing intermediate $S$ in HBM, significantly reducing IO time. Except for Softmax, row-wise computation requiring complete data for summation, crucial for fusion requires solving Softmax’s partition computation.

Softmax computation in mixed-precision training (FP16) risks overflow for large exponents, fixed by Safe Softmax. It adjusts all terms by the largest value $m$ to prevent overflow. Simplifies to expressions where exponents are non-positive, manageable in FP16.

Safe Softmax involves finding maximum $m$ , transforming $x_i \to e^{x_i-m}$ , and summing for normalization. For partitioned $n$ values, $mx = \max(mx1, mx2)$ . Adjust $px1, px2$ with coefficients, ensuring correct global maximum adjustments. Sum adjusted partitions for correct Softmax.

Though requiring extra variables per row, storage is minor. The balancing of computation with IO reduction is favorable.

Pseudo-code overview: QKV stored in $n \times d$ HBM, SRAM size $m$ . Column block size $\lceil \frac{m}{4d} \rceil$ . 4 for QKVO partitions. Row block size $\min(\frac{m}{4d}, d)$ to control Q block size. Initialize $O, L, M$ in HBM, partition QKV, $OL_M$ .

Outer loop KV, inner Q matches our animations. Load K, V partition to SRAM, loop Q, OL $m$ partitions, compute Q block, $QK^T$ , $MPL$ , update max values, $O$ using DAG $L$ ’s inverse matrix, update $O, L, M$ in HBM.

Backpropagation saves Softmax’s intermediate $M, L$ for quick recomputation during partitioned backpropagation, akin to gradient checkpointing.

FlashAttention increases computation slightly but drastically reduces HBM IO, markedly decreasing training time. FlashAttention2 follows similar principles, with further optimizations, reducing non-matrix computation, reversing Q/KV loops, enhancing parallelism, exploiting partition advantages, skipping computations for masked upper-triangle, further reducing computation.