PyTorch
Useful Snippets
torch.no_grad() and model.eval()
In PyTorch, torch.no_grad()
and model.eval()
are both used when evaluating a model, but they serve different purposes:
torch.no_grad():
- Function: This is a context manager that disables gradient calculation.
- Purpose: It is used to reduce memory consumption and speed up computations during inference, as gradients are not needed for evaluation.
- How it works: Inside the
with torch.no_grad()
block, PyTorch will not track operations for backpropagation, which saves memory and computation time.
model.eval():
- Function: This method sets the model to evaluation mode.
- Purpose: It changes the behavior of certain layers that behave differently during training and evaluation, such as dropout and batch normalization.
- How it works:
- Dropout: During training, dropout randomly deactivates neurons to prevent overfitting. In evaluation mode, dropout is turned off, allowing all neurons to participate.
- Batch Normalization: During training, batch normalization uses batch statistics to normalize the activations. In evaluation mode, it uses running statistics calculated during training.
when to use each
torch.no_grad()
:
Use this whenever you are performing inference and do not need to calculate gradients.
model.eval()
:
Use this to put your model in evaluation mode, which is essential when evaluating your model’s performance.
best practice
Combine both: It is common to use both model.eval() and torch.no_grad() together during evaluation:
register_buffer
Buffers won’t be returned in model.parameters()
, so that the optimizer won’t have a chance to update them. Another one is that all buffers and parameters will be pushed to the device, if called on the parent model.
As shown above in the console output, model.my_tensor
is still on the CPU, where it was created, while all parameters and buffers were pushed to the GPU after calling model.cuda()
.
automatic mixed precision
AMP is a feature in PyTorch designed to optimize training performance by using lower precision (e.g., float16) where possible. This reduces memory usage and improves computational speed, particularly on GPUs, without significantly affecting model accuracy. gpt-4o
Key Points:
- Precision Types: AMP uses mixed precision, typically combining
float16
(for faster computation and less memory) andfloat32
(for operations where reduced precision could lead to errors). - Usage: It is often enabled through a context manager, torch.amp.autocast, which scopes the precision of computations within the block.
- Benefits: AMP can lead to faster training times and reduced memory usage on supported hardware like NVIDIA GPUs.
Example:
Functions
einsum
Misc.
The weight matrix for torch.nn.Linear
is stored in a transpose fashion for computational purpose.
yields output