>>> a = torch.tensor([2048], dtype=torch.float16)>>> b = torch.tensor([0.5], dtype=torch.float16)>>> a + btensor([2048.], dtype=torch.float16)
>>> a = torch.tensor([2048], dtype=torch.float32)>>> b = torch.tensor([0.5], dtype=torch.float32)>>> a + btensor([2048.5000])
using AMP in PyTorch
use_amp = Truenet = make_model(in_size, out_size, num_layers)opt = torch.optim.SGD(net.parameters(), lr=0.001)scaler = torch.cuda.amp.GradScaler(enabled=use_amp)for epoch in range(epochs): for input, target in zip(data, targets): with torch.autocast(device_type=device, dtype=torch.float16, enabled=use_amp): output = net(input) loss = loss_fn(output, target) # Scales loss. Calls ``backward()`` on scaled loss to create scaled gradients. scaler.scale(loss).backward() # ``scaler.step()`` first unscales the gradients of the optimizer's assigned parameters. # If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called, # otherwise, optimizer.step() is skipped. scaler.step(opt) # Updates the scale for next iteration. scaler.update() opt.zero_grad()
Distributed Training
Model Parallelism
low GPU utilization
When dealing with a very large model that cannot fit in a single GPU,
we can partition the model across multiple GPUs. Each GPU only contains a couple of layers.
For a particular batch, the data is fed into GPU 0, it performs the forward pass,
passes the output to GPU 1, which continues the forward pass, and so on.
After the final GPU, GPU n, it performs the backward propagation, passes the gradients back to GPU n−1,
and gradients flow all the way back to GPU 0.
With this approach we can certainly train a larger model. However, note that at any given time,
only a single GPU is actually working, resulting in low GPU utilization.
To see that, suppose there are K GPUs, the forward pass takes Tf time on each GPU,
and the backward propagation takes Tb time on each GPU. The total available GPU time is then calculated by
the number of GPUs and the actual running time, i.e.,
K⋅K⋅(Tf+Tb)
while the actual GPU running time is
K⋅(Tf+Tb)
The utilization is then
K⋅K⋅(Tf+Tb)K⋅(Tf+Tb)=K1
When K is larger, i.e., the number of GPUs increases, the utilization decreases towards 0.
redundant memory usage from intermediate results
During backward propagation, the gradients are computed and stored in GPU memory.
Suppose the batch size is B, the number of layers is L, and each layer has a dimension dominated by d,
then for each GPU, the extra memory cost is O(B⋅KL⋅d).
As the model size and batch size grow, this extra memory cost may become significant.
Pipeline Parallelism
To mitigate the above issues from model parallelism, Google introduced Gpipe
that uses pipeline parallelism to improve GPU utilization and reduce memory usage.
The main idea of pipeline parallelism is essentially introduce data parallelism on top of model parallelism.
During the foward pass, the pipeline divides each mini-batch of size B into M smaller micro-batches,
which are pipelined through K GPUs.
During the backward pass, gradients for each micro-batch are computed based on the same model parameters used for the forward pass.
At the end of each mini-batch, gradients from all M micro-batches are accumulated and applied to update the
model parameters across all accelerators.
If batch normalization is used in the network, the sufficient statistics of inputs during training
are computed over each micro-batch and over replicas if necessary.
Gpipe also tracks the moving average of the sufficient statistics over the entire mini-batch to be used during evaluation.
Layer normalization is not affected.
The utilization of pipeline parallelism is given by
K⋅(M+K−1)⋅(Tf+Tb)K⋅M⋅(Tf+Tb)=M+K−1M
When M is noticably larger than K, the utilization is close to 1.
Warning
Pipeline parallelism is not that common in practice because the actual performance gain
depends heavily on whether the model can be evenly distributed to different GPUs.
Data Parallelism
GPUs are categorized into server GPUs and worker GPUs.
The full model is copied to multiple worker GPUs, a batch of data is partitioned evenly into several chunks and assigned to each worker GPU.
Each worker GPU performs forward and backward propagations individually on its assigned data.
Gradients are then pushed to server GPUs, which then aggregate all the gradients,
and pass them along to all the worker GPUs.
Note that:
there can be multiple GPUs under a single server/worker
sometimes other than aggregating the gradients, server is also responsible for updating the parameters before sending them to worker GPUs
Distributed Data Parallelism
ring-allreduce
reduce-scatter
Consider a topology of GPUs in a ring, where each GPU only communicates with the two GPUs that are adjacent to it.
all-gather
computation analysis
Denote by Φ the number of parameters in the model, and N the number of GPUs.
The number of gradients is the same as the number of parameters, i.e., Φ.
For a single GPU, the computation overhead for sending is
Seemingly the computation is on par with the case in DP,
but do note that the actual running time might be different. DDP achieves a much better balance across workers at any given time,
while DP puts the heavy load on server GPUs. When more and more GPUs are distributed among distanced machines,
the communication cost in DP will increase significantly.
ZeRO
ZeRO eliminates memory redundancies by partitioning optimizer states, gradients, and model parameters into each GPU.