Backpropagation
Tip
Dive deep(pun intended) into neural nets with some exciting learnings! Try to become a back prop ninja with Karpathyβs courses.
Derive Backprop Gradients step by step
dprobs = dlogprobs / probs
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)
dcounts = dprobs * counts_sum_inv
dcounts_sum = -dcounts_sum_inv * counts_sum ** -2
dcounts += dcounts_sum.broadcast_to(counts.shape)
dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
dlogits = dnorm_logits.clone()
tmp = torch.zeros_like(logits)
tmp[range(n), logits.max(1, keepdim=True).indices.view(-1)] = 1 # try F.one_hot
dlogits += dlogit_maxes * tmp
dh = dlogits @ W2.T
dW2 = h.T @ dlogits
db2 = dlogits.sum(0, keepdim=False)
# dhpreact = dh * (1 - torch.tanh(hpreact) ** 2)
# dhpreact = (1.0 - h ** 2) * dh # figure out later
dhpreact = hpreact.grad.clone()
# dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
dbnbias = dhpreact.sum(0, keepdim=True)
dbnraw = dhpreact * bngain
dbnvar_inv = (dbnraw * bndiff).sum(0, keepdim=True)
dbndiff = dbnraw * bnvar_inv
# dbnvar = dbnvar_inv * (-0.5) * bnvar_inv ** 3
dbnvar = dbnvar_inv * (-0.5) * (bnvar + 1e-5) ** -1.5
dbndiff2 = 1.0 / (n-1) * torch.ones_like(bndiff2) * dbnvar
dbndiff += 2 * bndiff * dbndiff2
dbnmeani = -dbndiff.sum(0, keepdim=True)
dhprebn = dbndiff.clone()
dhprebn += dbnmeani * 1.0 / n * torch.ones_like(hprebn)
dembcat = dhprebn @ W1.T
dW1 = embcat.T @ dhprebn
db1 = dhprebn.sum(0, keepdim=False)
demb = dembcat.view(emb.shape)
dC = torch.zeros_like(C)
for k in range(Xb.shape[0]):
for j in range(Xb.shape[1]):
ix = Xb[k,j]
dC[ix] += demb[k,j]
dlogprobs
Notation-wise, stands for the gradient of loss through . To start we have the following
which easily yields the gradient as follows.
dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1. / n
dprobs
Continue the process, we have
Hence
dprobs = dlogprobs / probs
dcounts_sum_inv
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)
Note that probs = counts * counts_sum_inv
is in fact
For simplicity of notation, denote counts_sum_inv
by csi
.
>>> counts.shape, counts_sum_inv.shape
(torch.Size([32, 27]), torch.Size([32, 1]))
Hence
and
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)
dcounts
Note that from one also has
probs = counts * counts_sum_inv
which leads to
However, other than contributing to loss through probs
, counts
also does that through counts_sum
and then counts_sum_inv
. The complete gradient of dcounts
remains to be determined.
counts_sum_inv = counts_sum**-1
dcounts_sum
leads to
and then
dnorm_logits
dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts
dlogit_maxes
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
dlogits
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
Backprop through cross_entropy
but All in One Go
Basically what happens in the forward pass can be described by the following pseudocode
logprobs = log(norm(softmax(logits, 1), 1))
loss = -mean(logprobs[range(n), Yb])
To discuss derivative for each single element, for every , denote by . For simplicity of notation, denote by .
Note
Keep in mind that there is supposed to be a subtracting the maximum of logits (in each row) in the numerator. Here it is omitted because it does not affect the gradient towards loss.
Now conduct chain rules to derive the derivatives. If ,
which yields
If ,
which yields
While the two cases are discussed separately, they share a common part in softmax
.
dlogits = F.softmax(logits, 1)
dlogits[range(n), Yb] -= 1
dlogits /= n
Finish calculating the rest derivatives. Read more
A typical training loop looks like this. One thing worth mentioning is the line optimizer.zero_grad()
.
Gradients in PyTorch are accumulated by default. Without zeroing them, gradients from multiple backpropagation steps would add up, leading to incorrect model updates.
In other words, zeroing gradients ensures that each backward pass calculates gradients based only on the current mini-batch.
# Training loop
losses = []
for epoch in range(num_epochs):
epoch_loss = 0
for batch_X, batch_y in dataloader:
# Forward pass
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
# Backward pass and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
# Record average loss for the epoch
avg_loss = epoch_loss / len(dataloader)
losses.append(avg_loss)
These steps are part of each training loop iteration and help in adjusting the modelβs parameters to improve its predictions.
-
optimizer.zero_grad():
- Purpose: Clears old gradients.
- What Happens: Resets the gradients of all model parameters to zero. This prevents the accumulation of gradients from multiple backpropagations, ensuring each mini-batch is computed independently.
-
loss.backward():
- Purpose: Computes gradients.
- What Happens: Performs backpropagation to compute the gradient of the loss with respect to each parameter (weight and bias). This is where the network learns, by calculating how much each parameter needs to change to minimize the loss.
-
optimizer.step():
- Purpose: Updates parameters.
- What Happens: Updates the model parameters based on the calculated gradients. This step uses the gradients computed during
loss.backward()
to adjust the weights in an attempt to minimize the loss.
Loss Functions
hinge loss
cross-entropy loss
The cross-entropy between a βtrueβ distribution p and an estimated distribution q is defined as:
Possibly confusing naming conventions
To be precise, theΒ SVM classifierΒ uses theΒ hinge loss, or also sometimes called theΒ max-margin loss. TheΒ Softmax classifierΒ uses theΒ cross-entropy loss. The Softmax classifier gets its name from theΒ softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesnβt make sense to talk about the βsoftmax lossβ, since softmax is just the squashing function, but it is a relatively commonly used shorthand.
focal loss
readme
ROC-AUC
Basic Concepts
Consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive or negative. There are four possible outcomes from a binary classifier:
- True Positive(TP): the prediction is positive and the actual value is positive.
- True Negative(TN): the prediction is negative and the actual value is negative.
- False Positive(FP): the prediction is positive and the actual value is negative.
- False Negative(FN): the prediction is negative and the actual value is positive.
Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total positive predictions (true positives + false positives). High precision indicates that fewer false positives are present.
Recall (also known as sensitivity or true positive rate) measures the ability of a model to identify all relevant instances. It is the ratio of true positive predictions to the actual number of positive cases (true positives + false negatives). High recall means the model successfully captured most of the actual positives.
F1 score is
Calculation and Interpretation
Note
The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming positive ranks higher than negative). 1
some probablistic interpretation
The ROC curve can be parameterized by and .
Given a threshold parameter , the instance is identified as positive if and negative otherwise. follows a probability density function if the instance actually belongs to the class positive, and otherwise. Therefore, the true positive rate is given by and the false positive rate is given by . According to the following maps, and , the area under the curve is given by:
Read more on ROCβs Wiki page and some CSDN blog post
Suppose the numbers of positive and negative samples in a batch are and , respectively.
where is a characteristic function:
The time complexity is . Denote by the rank score of -th sample, i.e., for with the smallest prediction score, and for with the largest prediction score, .
For each positive sample, its rank is precisely the same amount of contribution towards the count of pairs where positive sample is picked over negative sample (because of higher score).
Therefore, one can first sum up all ranks from positive samples, then subtract those cases where two positive samples are paired. Note that for the highest ranked positive sample, there are positive samples below it, corresponding to a subtraction of . A simple deduction leads to the final subtraction is
Finally we have
The time complexity is , which is noticeably better than when and are large.
Now we calculate with an example:
- Calculate AUC with the original formula:
- Calculate AUC with the quicker formula:
CUDA
Warning
FileNotFoundError: [Errno 2] No such file or directory: β:/usr/local/cuda-11.8:/usr/local/cuda-11.8/bin/nvccβ
solution
export CUDA_HOME=/usr/local/cuda
or change export CUDA_HOME=/usr/local/cuda-11.8
in ~/.bashrc
to export CUDA_HOME=/usr/local/cuda
.
For more detail refer to GitHub issue.
Footnotes
-
Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861β874. β©