Automatic Mixed Precision

advantages of low precision numbers

speeds up calculations
decreases memory footprint
decreases data transfer bandwidth

disadvantages of low precision numbers

smaller range of representation for numbers
swamping numbers or rounding errors

float16 vs float32

>>> a = torch.tensor([2048], dtype=torch.float16)
>>> b = torch.tensor([0.5], dtype=torch.float16)
>>> a + b
tensor([2048.], dtype=torch.float16)

>>> a = torch.tensor([2048], dtype=torch.float32)
>>> b = torch.tensor([0.5], dtype=torch.float32)
>>> a + b
tensor([2048.5000])

using AMP in PyTorch

use_amp = True
 
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
 
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16, enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
 
        # Scales loss. Calls ``backward()`` on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()
 
        # ``scaler.step()`` first unscales the gradients of the optimizer's assigned parameters.
        # If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)
        
        # Updates the scale for next iteration.
        scaler.update()
        opt.zero_grad()

Distributed Training

Model Parallelism

low GPU utilization

When dealing with a very large model that cannot fit in a single GPU, we can partition the model across multiple GPUs. Each GPU only contains a couple of layers.

For a particular batch, the data is fed into GPU 0, it performs the forward pass, passes the output to GPU 1, which continues the forward pass, and so on. After the final GPU, GPU $n$ , it performs the backward propagation, passes the gradients back to GPU $n - 1$ , and gradients flow all the way back to GPU 0.

With this approach we can certainly train a larger model. However, note that at any given time, only a single GPU is actually working, resulting in low GPU utilization.

To see that, suppose there are $K$ GPUs, the forward pass takes $T_{f}$ time on each GPU, and the backward propagation takes $T_{b}$ time on each GPU. The total available GPU time is then calculated by the number of GPUs and the actual running time, i.e.,

K \cdot K \cdot (T_{f} + T_{b})

while the actual GPU running time is

K \cdot (T_{f} + T_{b})

The utilization is then

\frac{K \cdot ( T _{f} + T _{b} )}{K \cdot K \cdot ( T _{f} + T _{b} )} = \frac{1}{K}

When $K$ is larger, i.e., the number of GPUs increases, the utilization decreases towards 0.

redundant memory usage from intermediate results

During backward propagation, the gradients are computed and stored in GPU memory. Suppose the batch size is $B$ , the number of layers is $L$ , and each layer has a dimension dominated by $d$ , then for each GPU, the extra memory cost is $O (B \cdot \frac{L}{K} \cdot d)$ . As the model size and batch size grow, this extra memory cost may become significant.

Pipeline Parallelism

To mitigate the above issues from model parallelism, Google introduced Gpipe that uses pipeline parallelism to improve GPU utilization and reduce memory usage.

The main idea of pipeline parallelism is essentially introduce data parallelism on top of model parallelism. During the foward pass, the pipeline divides each mini-batch of size $B$ into $M$ smaller micro-batches, which are pipelined through $K$ GPUs. During the backward pass, gradients for each micro-batch are computed based on the same model parameters used for the forward pass. At the end of each mini-batch, gradients from all $M$ micro-batches are accumulated and applied to update the model parameters across all accelerators.

If batch normalization is used in the network, the sufficient statistics of inputs during training are computed over each micro-batch and over replicas if necessary. Gpipe also tracks the moving average of the sufficient statistics over the entire mini-batch to be used during evaluation. Layer normalization is not affected.

The utilization of pipeline parallelism is given by

\frac{K \cdot M \cdot ( T _{f} + T _{b} )}{K \cdot ( M + K - 1 ) \cdot ( T _{f} + T _{b} )} = \frac{M}{M + K - 1}

When $M$ is noticably larger than $K$ , the utilization is close to 1.

Warning

Pipeline parallelism is not that common in practice because the actual performance gain depends heavily on whether the model can be evenly distributed to different GPUs.

Data Parallelism

GPUs are categorized into server GPUs and worker GPUs. The full model is copied to multiple worker GPUs, a batch of data is partitioned evenly into several chunks and assigned to each worker GPU. Each worker GPU performs forward and backward propagations individually on its assigned data. Gradients are then pushed to server GPUs, which then aggregate all the gradients, and pass them along to all the worker GPUs.

Note that:

there can be multiple GPUs under a single server/worker
sometimes other than aggregating the gradients, server is also responsible for updating the parameters before sending them to worker GPUs

Distributed Data Parallelism

ring-allreduce

reduce-scatter

Consider a topology of GPUs in a ring, where each GPU only communicates with the two GPUs that are adjacent to it.

all-gather

computation analysis

Denote by $Φ$ the number of parameters in the model, and $N$ the number of GPUs. The number of gradients is the same as the number of parameters, i.e., $Φ$ .

For a single GPU, the computation overhead for sending is

(N - 1) \cdot \frac{Φ}{N} (N - 1) \cdot \frac{Φ}{N} in reduce-scatter stage in all-gather stage

Hence the total computation overhead is $2 (N - 1) Φ$ .

Info

Seemingly the computation is on par with the case in DP, but do note that the actual running time might be different. DDP achieves a much better balance across workers at any given time, while DP puts the heavy load on server GPUs. When more and more GPUs are distributed among distanced machines, the communication cost in DP will increase significantly.

ZeRO

ZeRO eliminates memory redundancies by partitioning optimizer states, gradients, and model parameters into each GPU.

📚 Jiaqi's Knowledge Repository

Explorer

LLM Training

Automatic Mixed Precision

advantages of low precision numbers

disadvantages of low precision numbers

Distributed Training

Model Parallelism

low GPU utilization

redundant memory usage from intermediate results

Pipeline Parallelism

Data Parallelism

Distributed Data Parallelism

ring-allreduce

reduce-scatter

all-gather

computation analysis

ZeRO

readme

Graph View

Table of Contents