Backpropagation

Tip

Dive deep(pun intended) into neural nets with some exciting learnings! Try to become a back prop ninja with Karpathy’s courses.

Derive Backprop Gradients step by step

dprobs = dlogprobs / probs
 
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)
 
dcounts = dprobs * counts_sum_inv
 
dcounts_sum = -dcounts_sum_inv * counts_sum ** -2
 
dcounts += dcounts_sum.broadcast_to(counts.shape)
 
dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts
 
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
 
dlogits = dnorm_logits.clone()
 
tmp = torch.zeros_like(logits)
 
tmp[range(n), logits.max(1, keepdim=True).indices.view(-1)] = 1 # try F.one_hot
 
dlogits += dlogit_maxes * tmp
 
dh = dlogits @ W2.T
 
dW2 = h.T @ dlogits
 
db2 = dlogits.sum(0, keepdim=False)
 
# dhpreact = dh * (1 - torch.tanh(hpreact) ** 2)
 
# dhpreact = (1.0 - h ** 2) * dh # figure out later
 
dhpreact = hpreact.grad.clone()
 
# dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
 
dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
 
dbnbias = dhpreact.sum(0, keepdim=True)
 
dbnraw = dhpreact * bngain
 
dbnvar_inv = (dbnraw * bndiff).sum(0, keepdim=True)
 
dbndiff = dbnraw * bnvar_inv
 
# dbnvar = dbnvar_inv * (-0.5) * bnvar_inv ** 3
 
dbnvar = dbnvar_inv * (-0.5) * (bnvar + 1e-5) ** -1.5
 
dbndiff2 = 1.0 / (n-1) * torch.ones_like(bndiff2) * dbnvar
 
dbndiff += 2 * bndiff * dbndiff2
 
dbnmeani = -dbndiff.sum(0, keepdim=True)
 
dhprebn = dbndiff.clone()
 
dhprebn += dbnmeani * 1.0 / n * torch.ones_like(hprebn)
 
dembcat = dhprebn @ W1.T
 
dW1 = embcat.T @ dhprebn
 
db1 = dhprebn.sum(0, keepdim=False)
 
demb = dembcat.view(emb.shape)
 
dC = torch.zeros_like(C)
 
for k in range(Xb.shape[0]):
 
for j in range(Xb.shape[1]):
 
ix = Xb[k,j]
 
dC[ix] += demb[k,j]

dlogprobs

Notation-wise, dlogprobs stands for the gradient of loss through logprobs. To start we have the following

loss=1𝑛𝑖=1𝑗𝑌𝑏logprobs𝑖,𝑗

which easily yields the gradient as follows.

(dlossdlogprobs)𝑖,𝑗={1𝑛,𝑗𝑌𝑏0,otherwise
dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1. / n

dprobs

Continue the process, we have

logprobs𝑖,𝑗=log(probs𝑖,𝑗)

Hence

(dlossdprobs)𝑖,𝑗=(dlossdlogprobs)𝑖,𝑗·(dlogprobsdprobs)𝑖,𝑗=(dlossdlogprobs)𝑖,𝑗·1probs𝑖,𝑗
dprobs = dlogprobs / probs

dcounts_sum_inv

dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

Note that probs = counts * counts_sum_inv is in fact

probs𝑖,𝑗=counts𝑖,𝑗·csi𝑖

For simplicity of notation, denote counts_sum_inv by csi.

>>> counts.shape, counts_sum_inv.shape
(torch.Size([32, 27]), torch.Size([32, 1]))

Hence

(dprobsdcsi)𝑖=𝑗counts𝑖,𝑗

and

(dprobsdcsi)𝑖=𝑗(dlossdprobs)𝑖,𝑗·(dprobsdcsi)𝑖
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

dcounts

Note that from one also has

probs = counts * counts_sum_inv

which leads to

(dprobsdcounts)𝑖,𝑗=csi𝑖

However, other than contributing to loss through probs, counts also does that through counts_sum and then counts_sum_inv. The complete gradient of dcounts remains to be determined.

counts_sum_inv = counts_sum**-1

dcounts_sum

leads to

(dcsidcs)𝑖=cs2𝑖

and then

(dlossdcounts_sum)𝑖=(dlossdcsi)𝑖·counts_sum2𝑖

dnorm_logits

dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts

dlogit_maxes

dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

dlogits

dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

Backprop through cross_entropy but All in One Go

Basically what happens in the forward pass can be described by the following pseudocode

logprobs = log(norm(softmax(logits, 1), 1))
loss = -mean(logprobs[range(n), Yb])

To discuss derivative for each single element, for every 𝑖, denote (𝑌𝑏)𝑖 by 𝑦. For simplicity of notation, denote logits by lg.

loss=1𝑛𝑛𝑖=1logprobs𝑖,𝑦=1𝑛𝑛𝑖=1log(probs𝑖,𝑦)=1𝑛𝑛𝑖=1log(exp{lg𝑖,𝑦}𝑘exp{lg𝑖,𝑘})

Note

Keep in mind that there is supposed to be a subtracting the maximum of logits (in each row) in the numerator. Here it is omitted because it does not affect the gradient towards loss.

Now conduct chain rules to derive the derivatives. If 𝑗𝑦,

(dlossdlg)𝑖,𝑗=1𝑛𝑘exp{lg𝑖,𝑘}exp{lg𝑖,𝑦}·(1)·exp{lg𝑖,𝑦}(𝑘exp{lg𝑖,𝑘})2·exp{lg𝑖,𝑗}

which yields

(dlossdlg)𝑖,𝑗=1𝑛·exp{lg𝑖,𝑗}𝑘exp{lg𝑖,𝑘}=1𝑛·softmax(lg𝑖,·)𝑗

If 𝑗=𝑦,

(dlossdlg)𝑖,𝑗=1𝑛𝑘exp{lg𝑖,𝑘}exp{lg𝑖,𝑦}·exp{lg𝑖,𝑦}·𝑘exp{lg𝑖,𝑘}exp2{lg𝑖,𝑦}(𝑘exp{lg𝑖,𝑘})2

which yields

(dlossdlg)𝑖,𝑗=1𝑛+1𝑛·exp{lg𝑖,𝑦}𝑘exp{lg𝑖,𝑘}=1𝑛·(softmax(lg𝑖,·)𝑦)1

While the two cases are discussed separately, they share a common part in softmax.

dlogits = F.softmax(logits, 1)
dlogits[range(n), Yb] -= 1
dlogits /= n

Finish calculating the rest derivatives. Read more

A typical training loop looks like this. One thing worth mentioning is the line optimizer.zero_grad().

Gradients in PyTorch are accumulated by default. Without zeroing them, gradients from multiple backpropagation steps would add up, leading to incorrect model updates.

In other words, zeroing gradients ensures that each backward pass calculates gradients based only on the current mini-batch.

# Training loop
losses = []
for epoch in range(num_epochs):
    epoch_loss = 0
    for batch_X, batch_y in dataloader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
 
        # Backward pass and optimize
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
 
        epoch_loss += loss.item()
 
    # Record average loss for the epoch
    avg_loss = epoch_loss / len(dataloader)
    losses.append(avg_loss)

These steps are part of each training loop iteration and help in adjusting the model’s parameters to improve its predictions.

  1. optimizer.zero_grad():

    • Purpose: Clears old gradients.
    • What Happens: Resets the gradients of all model parameters to zero. This prevents the accumulation of gradients from multiple backpropagations, ensuring each mini-batch is computed independently.
  2. loss.backward():

    • Purpose: Computes gradients.
    • What Happens: Performs backpropagation to compute the gradient of the loss with respect to each parameter (weight and bias). This is where the network learns, by calculating how much each parameter needs to change to minimize the loss.
  3. optimizer.step():

    • Purpose: Updates parameters.
    • What Happens: Updates the model parameters based on the calculated gradients. This step uses the gradients computed during loss.backward() to adjust the weights in an attempt to minimize the loss.

Loss Functions

hinge loss

cross-entropy loss

The cross-entropy between a “true” distribution p and an estimated distribution q is defined as:

𝐻(𝑝,𝑞)=𝑥𝑝(𝑥)log𝑞(𝑥)

Possibly confusing naming conventions

To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss. The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn’t make sense to talk about the “softmax loss”, since softmax is just the squashing function, but it is a relatively commonly used shorthand.

focal loss

readme

ROC-AUC

Basic Concepts

Consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive or negative. There are four possible outcomes from a binary classifier:

  • True Positive(TP): the prediction is positive and the actual value is positive.
  • True Negative(TN): the prediction is negative and the actual value is negative.
  • False Positive(FP): the prediction is positive and the actual value is negative.
  • False Negative(FN): the prediction is negative and the actual value is positive.

Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total positive predictions (true positives + false positives). High precision indicates that fewer false positives are present.

Precision=TPTP+FP

Recall (also known as sensitivity or true positive rate) measures the ability of a model to identify all relevant instances. It is the ratio of true positive predictions to the actual number of positive cases (true positives + false negatives). High recall means the model successfully captured most of the actual positives.

Recall=TPTP+FN

F1 score is

F1=21Precision+1Recall

Calculation and Interpretation

Note

The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming positive ranks higher than negative). 1

some probablistic interpretation

The ROC curve {(𝑥,𝑦)} can be parameterized by 𝑦=TPR(𝑇) and 𝑥=FPR(𝑇).

Given a threshold parameter 𝑇, the instance is identified as positive if 𝑋>𝑇 and negative otherwise. 𝑋 follows a probability density function 𝑓1(𝑥) if the instance actually belongs to the class positive, and 𝑓0(𝑥) otherwise. Therefore, the true positive rate is given by TPR(𝑇)=𝑇𝑓1(𝑥)dx and the false positive rate is given by FPR(𝑇)=𝑇𝑓0(𝑥)dx. According to the following maps, TPR(𝑇):𝑇𝑦(𝑥) and FPR(𝑇):𝑇𝑥, the area under the curve is given by:

AUC=1𝑥=0𝑦(𝑥)dx=1𝑥=0TPR(𝑇)dx=1𝑥=0TPR(FPR1(𝑥))dx=TPR(𝑇)FPR(𝑇)dT=TPR(𝑇)𝑓0(𝑇)dT=𝑇𝑓1(𝑥)dx𝑓0(𝑇)dT=𝟙{𝑥𝑇}𝑓1(𝑥)dx𝑓0(𝑇)dT={}{}𝟙{𝑇𝑇}𝑓1(𝑇)𝑓0(𝑇)dTdT=𝑃(𝑋1𝑋0)

Read more on ROC’s Wiki page and some CSDN blog post

Suppose the numbers of positive and negative samples in a batch are 𝑀 and 𝑁, respectively.

AUC=𝕀(𝑃pos,𝑃neg)𝑀·𝑁

where 𝕀 is a characteristic function:

𝕀(𝑝,𝑞)={{{{{1,𝑝>𝑞0.5,𝑝=𝑞0,𝑝<𝑞

The time complexity is 𝑂(𝑀𝑁). Denote by 𝑟𝑖 the rank score of 𝑖-th sample, i.e., for 𝑥𝑗 with the smallest prediction score, 𝑟𝑗=0 and for 𝑥𝑘 with the largest prediction score, 𝑟𝑘=𝑀+𝑁1.

For each positive sample, its rank is precisely the same amount of contribution towards the count of pairs where positive sample is picked over negative sample (because of higher score).

Therefore, one can first sum up all ranks from positive samples, then subtract those cases where two positive samples are paired. Note that for the highest ranked positive sample, there are 𝑀1 positive samples below it, corresponding to a subtraction of 𝑀1. A simple deduction leads to the final subtraction is

(𝑀1)++1+0=(𝑀1)𝑀2.

Finally we have

AUC=𝑖Positive𝑟𝑖(𝑀1)𝑀2𝑀·𝑁

The time complexity is 𝑂((𝑀+𝑁)log(𝑀+𝑁)), which is noticeably better than 𝑂(𝑀𝑁) when 𝑀 and 𝑁 are large.

Now we calculate with an example:

real labelpred scorerank10.9510.7400.6310.55200.2100.10
  • Calculate AUC with the original formula:
AUC=𝟏0.9·3+𝟏0.7·3+𝟏0.55·23·3=89.
  • Calculate AUC with the quicker formula:
AUC=(5+4+2)(3·3)(31)(2·3)=89.

CUDA

Warning

FileNotFoundError: [Errno 2] No such file or directory: ’:/usr/local/cuda-11.8:/usr/local/cuda-11.8/bin/nvcc’

solution

export CUDA_HOME=/usr/local/cuda

or change export CUDA_HOME=/usr/local/cuda-11.8 in ~/.bashrc to export CUDA_HOME=/usr/local/cuda.

For more detail refer to GitHub issue.

Footnotes

  1. Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.