PyTorch

Useful Snippets

torch.no_grad() and model.eval()

In PyTorch, torch.no_grad() and model.eval() are both used when evaluating a model, but they serve different purposes:

torch.no_grad():

Function: This is a context manager that disables gradient calculation.
Purpose: It is used to reduce memory consumption and speed up computations during inference, as gradients are not needed for evaluation.
How it works: Inside the with torch.no_grad() block, PyTorch will not track operations for backpropagation, which saves memory and computation time.

model.eval():

Function: This method sets the model to evaluation mode.
Purpose: It changes the behavior of certain layers that behave differently during training and evaluation, such as dropout and batch normalization.
How it works:
- Dropout: During training, dropout randomly deactivates neurons to prevent overfitting. In evaluation mode, dropout is turned off, allowing all neurons to participate.
- Batch Normalization: During training, batch normalization uses batch statistics to normalize the activations. In evaluation mode, it uses running statistics calculated during training.

when to use each

torch.no_grad():

Use this whenever you are performing inference and do not need to calculate gradients.

model.eval():

Use this to put your model in evaluation mode, which is essential when evaluating your model’s performance.

best practice

Combine both: It is common to use both model.eval() and torch.no_grad() together during evaluation:

with torch.no_grad():
    model.eval()
    # Perform evaluation
 
# OR use `torch.no_grad()` as a decorator
@torch.no_grad()
def some_method(self,):

register_buffer

Tip

If you have parameters in your model, which should be saved and restored in the state_dict, but not trained by the optimizer, you should register them as buffers.

Buffers won’t be returned in model.parameters(), so that the optimizer won’t have a chance to update them. Another one is that all buffers and parameters will be pushed to the device, if called on the parent model.

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.my_tensor = torch.randn(1)
        self.register_buffer('my_buffer', torch.randn(1))
        self.my_param = nn.Parameter(torch.randn(1))
 
model = MyModel()
print(model.my_tensor)
>>> tensor([-1.4624])
print(model.state_dict())
>>> OrderedDict([('my_param', tensor([-1.7173])), ('my_buffer', tensor([0.7523]))])
 
model.cuda()
print(model.my_tensor)
>>> tensor([-1.4624])
print(model.state_dict())
>>> OrderedDict([('my_param', tensor([-1.7173], device='cuda:0')), ('my_buffer', tensor([0.7523], device='cuda:0'))])

As shown above in the console output, model.my_tensor is still on the CPU, where it was created, while all parameters and buffers were pushed to the GPU after calling model.cuda().

automatic mixed precision

with torch.cuda.amp.autocast():
    outputs = model(inputs)  # Operations here use mixed precision
    loss = compute_loss(outputs)

AMP is a feature in PyTorch designed to optimize training performance by using lower precision (e.g., float16) where possible.

This reduces memory usage and improves computational speed, particularly on GPUs, without significantly affecting model accuracy.

Key Points:

Precision Types: AMP uses mixed precision, typically combining float16 (for faster computation and less memory) and float32 (for operations where reduced precision could lead to errors).
Usage: It is often enabled through a context manager, torch.amp.autocast, which scopes the precision of computations within the block.
Benefits: AMP can lead to faster training times and reduced memory usage on supported hardware like NVIDIA GPUs.

Example:

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

Functions

einsum

"""
https://pytorch.org/docs/stable/generated/torch.einsum.html
"""
>>> # trace
>>> torch.einsum('ii', torch.randn(4, 4))
tensor(-1.2104)
 
>>> # diagonal
>>> torch.einsum('ii->i', torch.randn(4, 4))
tensor([-0.1034,  0.7952, -0.2433,  0.4545])
 
>>> # outer product
>>> x = torch.randn(5)
>>> y = torch.randn(4)
>>> torch.einsum('i,j->ij', x, y)
tensor([[ 0.1156, -0.2897, -0.3918,  0.4963],
        [-0.3744,  0.9381,  1.2685, -1.6070],
        [ 0.7208, -1.8058, -2.4419,  3.0936],
        [ 0.1713, -0.4291, -0.5802,  0.7350],
        [ 0.5704, -1.4290, -1.9323,  2.4480]])
 
>>> # batch matrix multiplication
>>> As = torch.randn(3, 2, 5)
>>> Bs = torch.randn(3, 5, 4)
>>> torch.einsum('bij,bjk->bik', As, Bs)
tensor([[[-1.0564, -1.5904,  3.2023,  3.1271],
        [-1.6706, -0.8097, -0.8025, -2.1183]],
 
        [[ 4.2239,  0.3107, -0.5756, -0.2354],
        [-1.4558, -0.3460,  1.5087, -0.8530]],
 
        [[ 2.8153,  1.8787, -4.3839, -1.2112],
        [ 0.3728, -2.1131,  0.0921,  0.8305]]])
 
>>> # with sublist format and ellipsis
>>> torch.einsum(As, [..., 0, 1], Bs, [..., 1, 2], [..., 0, 2])
tensor([[[-1.0564, -1.5904,  3.2023,  3.1271],
        [-1.6706, -0.8097, -0.8025, -2.1183]],
 
        [[ 4.2239,  0.3107, -0.5756, -0.2354],
        [-1.4558, -0.3460,  1.5087, -0.8530]],
 
        [[ 2.8153,  1.8787, -4.3839, -1.2112],
        [ 0.3728, -2.1131,  0.0921,  0.8305]]])
 
>>> # batch permute
>>> A = torch.randn(2, 3, 4, 5)
>>> torch.einsum('...ij->...ji', A).shape
torch.Size([2, 3, 5, 4])
 
>>> # equivalent to torch.nn.functional.bilinear
>>> A = torch.randn(3, 5, 4)
>>> l = torch.randn(2, 5)
>>> r = torch.randn(2, 4)
>>> torch.einsum('bn,anm,bm->ba', l, A, r)
tensor([[-0.3430, -5.2405,  0.4494],
        [ 0.3311,  5.5201, -3.0356]])

Misc.

The weight matrix for torch.nn.Linear is stored in a transpose fashion for computational purpose.

a = torch.nn.Linear(4,5)
b = torch.nn.Embedding(4,5)
print(a)
print(b)
print(a.weight.shape)
print(b.weight.shape)
 
# yields output
# Linear(in_features=4, out_features=5, bias=True)
# Embedding(4, 5)
# torch.Size([5, 4])
# torch.Size([4, 5])

📚 Jiaqi's Knowledge Repository

Explorer

PyTorch

Useful Snippets

torch.no_grad() and model.eval()

torch.no_grad():

model.eval():

when to use each

best practice

register_buffer

automatic mixed precision

Functions

einsum

Misc.

Graph View

Table of Contents