Automatic Mixed Precision

advantages of low precision numbers

  • speeds up calculations
  • decreases memory footprint
  • decreases data transfer bandwidth

disadvantages of low precision numbers

  • smaller range of representation for numbers
  • swamping numbers or rounding errors

float16 vs float32

>>> a = torch.tensor([2048], dtype=torch.float16)
>>> b = torch.tensor([0.5], dtype=torch.float16)
>>> a + b
tensor([2048.], dtype=torch.float16)
>>> a = torch.tensor([2048], dtype=torch.float32)
>>> b = torch.tensor([0.5], dtype=torch.float32)
>>> a + b
tensor([2048.5000])

using AMP in PyTorch

use_amp = True
 
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
 
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16, enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
 
        # Scales loss. Calls ``backward()`` on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()
 
        # ``scaler.step()`` first unscales the gradients of the optimizer's assigned parameters.
        # If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)
        
        # Updates the scale for next iteration.
        scaler.update()
        opt.zero_grad()

Distributed Training

Model Parallelism

low GPU utilization

When dealing with a very large model that cannot fit in a single GPU, we can partition the model across multiple GPUs. Each GPU only contains a couple of layers.

For a particular batch, the data is fed into GPU 0, it performs the forward pass, passes the output to GPU 1, which continues the forward pass, and so on. After the final GPU, GPU , it performs the backward propagation, passes the gradients back to GPU , and gradients flow all the way back to GPU 0.

With this approach we can certainly train a larger model. However, note that at any given time, only a single GPU is actually working, resulting in low GPU utilization.

To see that, suppose there are GPUs, the forward pass takes time on each GPU, and the backward propagation takes time on each GPU. The total available GPU time is then calculated by the number of GPUs and the actual running time, i.e.,

while the actual GPU running time is

The utilization is then

When is larger, i.e., the number of GPUs increases, the utilization decreases towards 0.

redundant memory usage from intermediate results

During backward propagation, the gradients are computed and stored in GPU memory. Suppose the batch size is , the number of layers is , and each layer has a dimension dominated by , then for each GPU, the extra memory cost is . As the model size and batch size grow, this extra memory cost may become significant.

Pipeline Parallelism

To mitigate the above issues from model parallelism, Google introduced Gpipe that uses pipeline parallelism to improve GPU utilization and reduce memory usage.

The main idea of pipeline parallelism is essentially introduce data parallelism on top of model parallelism. During the foward pass, the pipeline divides each mini-batch of size into smaller micro-batches, which are pipelined through GPUs. During the backward pass, gradients for each micro-batch are computed based on the same model parameters used for the forward pass. At the end of each mini-batch, gradients from all micro-batches are accumulated and applied to update the model parameters across all accelerators.

If batch normalization is used in the network, the sufficient statistics of inputs during training are computed over each micro-batch and over replicas if necessary. Gpipe also tracks the moving average of the sufficient statistics over the entire mini-batch to be used during evaluation. Layer normalization is not affected.

The utilization of pipeline parallelism is given by

When is noticably larger than , the utilization is close to 1.

pipeline parallelism with microbatch

Warning

Pipeline parallelism is not that common in practice because the actual performance gain depends heavily on whether the model can be evenly distributed to different GPUs.

Data Parallelism

GPUs are categorized into server GPUs and worker GPUs. The full model is copied to multiple worker GPUs, a batch of data is partitioned evenly into several chunks and assigned to each worker GPU. Each worker GPU performs forward and backward propagations individually on its assigned data. Gradients are then pushed to server GPUs, which then aggregate all the gradients, and pass them along to all the worker GPUs.

Note that:

  1. there can be multiple GPUs under a single server/worker
  2. sometimes other than aggregating the gradients, server is also responsible for updating the parameters before sending them to worker GPUs

Distributed Data Parallelism

ring-allreduce

reduce-scatter

Consider a topology of GPUs in a ring, where each GPU only communicates with the two GPUs that are adjacent to it.

ring reduce scatter-dark ring reduce scatter-light

all-gather

all gather all gather all gather all gather all gather all gather all gather all gather

computation analysis

Denote by the number of parameters in the model, and the number of GPUs. The number of gradients is the same as the number of parameters, i.e., .

For a single GPU, the computation overhead for sending is

Hence the total computation overhead is .

Info

Seemingly the computation is on par with the case in DP, but do note that the actual running time might be different. DDP achieves a much better balance across workers at any given time, while DP puts the heavy load on server GPUs. When more and more GPUs are distributed among distanced machines, the communication cost in DP will increase significantly.

ZeRO

ZeRO eliminates memory redundancies by partitioning optimizer states, gradients, and model parameters into each GPU.

readme