Backpropagation

Tip

Dive deep(pun intended) into neural nets with some exciting learnings! Try to become a back prop ninja with Karpathy’s courses.

Derive Backprop Gradients step by step

dprobs = dlogprobs / probs
 
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)
 
dcounts = dprobs * counts_sum_inv
 
dcounts_sum = -dcounts_sum_inv * counts_sum ** -2
 
dcounts += dcounts_sum.broadcast_to(counts.shape)
 
dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts
 
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
 
dlogits = dnorm_logits.clone()
 
tmp = torch.zeros_like(logits)
 
tmp[range(n), logits.max(1, keepdim=True).indices.view(-1)] = 1 # try F.one_hot
 
dlogits += dlogit_maxes * tmp
 
dh = dlogits @ W2.T
 
dW2 = h.T @ dlogits
 
db2 = dlogits.sum(0, keepdim=False)
 
# dhpreact = dh * (1 - torch.tanh(hpreact) ** 2)
 
# dhpreact = (1.0 - h ** 2) * dh # figure out later
 
dhpreact = hpreact.grad.clone()
 
# dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
 
dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
 
dbnbias = dhpreact.sum(0, keepdim=True)
 
dbnraw = dhpreact * bngain
 
dbnvar_inv = (dbnraw * bndiff).sum(0, keepdim=True)
 
dbndiff = dbnraw * bnvar_inv
 
# dbnvar = dbnvar_inv * (-0.5) * bnvar_inv ** 3
 
dbnvar = dbnvar_inv * (-0.5) * (bnvar + 1e-5) ** -1.5
 
dbndiff2 = 1.0 / (n-1) * torch.ones_like(bndiff2) * dbnvar
 
dbndiff += 2 * bndiff * dbndiff2
 
dbnmeani = -dbndiff.sum(0, keepdim=True)
 
dhprebn = dbndiff.clone()
 
dhprebn += dbnmeani * 1.0 / n * torch.ones_like(hprebn)
 
dembcat = dhprebn @ W1.T
 
dW1 = embcat.T @ dhprebn
 
db1 = dhprebn.sum(0, keepdim=False)
 
demb = dembcat.view(emb.shape)
 
dC = torch.zeros_like(C)
 
for k in range(Xb.shape[0]):
 
for j in range(Xb.shape[1]):
 
ix = Xb[k,j]
 
dC[ix] += demb[k,j]

dlogprobs

Notation-wise, stands for the gradient of loss through . To start we have the following

which easily yields the gradient as follows.

dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1. / n

dprobs

Continue the process, we have

Hence

dprobs = dlogprobs / probs

dcounts_sum_inv

dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

Note that probs = counts * counts_sum_inv is in fact

For simplicity of notation, denote counts_sum_inv by csi.

>>> counts.shape, counts_sum_inv.shape
(torch.Size([32, 27]), torch.Size([32, 1]))

Hence

and

dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

dcounts

Note that from one also has

probs = counts * counts_sum_inv

which leads to

However, other than contributing to loss through probs, counts also does that through counts_sum and then counts_sum_inv. The complete gradient of dcounts remains to be determined.

counts_sum_inv = counts_sum**-1

dcounts_sum

leads to

and then

dnorm_logits

dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts

dlogit_maxes

dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

dlogits

dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

Backprop through cross_entropy but All in One Go

Basically what happens in the forward pass can be described by the following pseudocode

logprobs = log(norm(softmax(logits, 1), 1))
loss = -mean(logprobs[range(n), Yb])

To discuss derivative for each single element, for every , denote by . For simplicity of notation, denote by .

Note

Keep in mind that there is supposed to be a subtracting the maximum of logits (in each row) in the numerator. Here it is omitted because it does not affect the gradient towards loss.

Now conduct chain rules to derive the derivatives. If ,

which yields

If ,

which yields

While the two cases are discussed separately, they share a common part in softmax.

dlogits = F.softmax(logits, 1)
dlogits[range(n), Yb] -= 1
dlogits /= n

Finish calculating the rest derivatives. Read more

A typical training loop looks like this. One thing worth mentioning is the line optimizer.zero_grad().

Gradients in PyTorch are accumulated by default. Without zeroing them, gradients from multiple backpropagation steps would add up, leading to incorrect model updates.

In other words, zeroing gradients ensures that each backward pass calculates gradients based only on the current mini-batch.

# Training loop
losses = []
for epoch in range(num_epochs):
    epoch_loss = 0
    for batch_X, batch_y in dataloader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        # Backward pass and optimize
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        
        epoch_loss += loss.item()
    
    # Record average loss for the epoch
    avg_loss = epoch_loss / len(dataloader)
    losses.append(avg_loss)

These steps are part of each training loop iteration and help in adjusting the model’s parameters to improve its predictions.

  1. optimizer.zero_grad():

    • Purpose: Clears old gradients.
    • What Happens: Resets the gradients of all model parameters to zero. This prevents the accumulation of gradients from multiple backpropagations, ensuring each mini-batch is computed independently.
  2. loss.backward():

    • Purpose: Computes gradients.
    • What Happens: Performs backpropagation to compute the gradient of the loss with respect to each parameter (weight and bias). This is where the network learns, by calculating how much each parameter needs to change to minimize the loss.
  3. optimizer.step():

    • Purpose: Updates parameters.
    • What Happens: Updates the model parameters based on the calculated gradients. This step uses the gradients computed during loss.backward() to adjust the weights in an attempt to minimize the loss.

Loss Functions

hinge loss

cross-entropy loss

The cross-entropy between a β€œtrue” distribution p and an estimated distribution q is defined as:

Possibly confusing naming conventions

To be precise, theΒ SVM classifierΒ uses theΒ hinge loss, or also sometimes called theΒ max-margin loss. TheΒ Softmax classifierΒ uses theΒ cross-entropy loss. The Softmax classifier gets its name from theΒ softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn’t make sense to talk about the β€œsoftmax loss”, since softmax is just the squashing function, but it is a relatively commonly used shorthand.

focal loss

readme

ROC-AUC

Basic Concepts

Consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive or negative. There are four possible outcomes from a binary classifier:

  • True Positive(TP): the prediction is positive and the actual value is positive.
  • True Negative(TN): the prediction is negative and the actual value is negative.
  • False Positive(FP): the prediction is positive and the actual value is negative.
  • False Negative(FN): the prediction is negative and the actual value is positive.

Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total positive predictions (true positives + false positives). High precision indicates that fewer false positives are present.

Recall (also known as sensitivity or true positive rate) measures the ability of a model to identify all relevant instances. It is the ratio of true positive predictions to the actual number of positive cases (true positives + false negatives). High recall means the model successfully captured most of the actual positives.

F1 score is

Calculation and Interpretation

Note

The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming positive ranks higher than negative). 1

some probablistic interpretation

The ROC curve can be parameterized by and .

Given a threshold parameter , the instance is identified as positive if and negative otherwise. follows a probability density function if the instance actually belongs to the class positive, and otherwise. Therefore, the true positive rate is given by and the false positive rate is given by . According to the following maps, and , the area under the curve is given by:

Read more on ROC’s Wiki page and some CSDN blog post

Suppose the numbers of positive and negative samples in a batch are and , respectively.

where is a characteristic function:

The time complexity is . Denote by the rank score of -th sample, i.e., for with the smallest prediction score, and for with the largest prediction score, .

For each positive sample, its rank is precisely the same amount of contribution towards the count of pairs where positive sample is picked over negative sample (because of higher score).

Therefore, one can first sum up all ranks from positive samples, then subtract those cases where two positive samples are paired. Note that for the highest ranked positive sample, there are positive samples below it, corresponding to a subtraction of . A simple deduction leads to the final subtraction is

Finally we have

The time complexity is , which is noticeably better than when and are large.

Now we calculate with an example:

  • Calculate AUC with the original formula:
  • Calculate AUC with the quicker formula:

CUDA

Warning

FileNotFoundError: [Errno 2] No such file or directory: ’:/usr/local/cuda-11.8:/usr/local/cuda-11.8/bin/nvcc’

solution

export CUDA_HOME=/usr/local/cuda

or change export CUDA_HOME=/usr/local/cuda-11.8 in ~/.bashrc to export CUDA_HOME=/usr/local/cuda.

For more detail refer to GitHub issue.

Footnotes

  1. Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874. ↩