Dive deep(pun intended) into neural nets with some exciting learnings! Try to become a back prop ninja with Karpathy’s courses.

Derive Backprop Gradients step by step

 
dprobs = dlogprobs / probs
 
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)
 
dcounts = dprobs * counts_sum_inv
 
dcounts_sum = -dcounts_sum_inv * counts_sum ** -2
 
dcounts += dcounts_sum.broadcast_to(counts.shape)
 
dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts
 
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
 
dlogits = dnorm_logits.clone()
 
tmp = torch.zeros_like(logits)
 
tmp[range(n), logits.max(1, keepdim=True).indices.view(-1)] = 1 # try F.one_hot
 
dlogits += dlogit_maxes * tmp
 
dh = dlogits @ W2.T
 
dW2 = h.T @ dlogits
 
db2 = dlogits.sum(0, keepdim=False)
 
# dhpreact = dh * (1 - torch.tanh(hpreact) ** 2)
 
# dhpreact = (1.0 - h ** 2) * dh # figure out later
 
dhpreact = hpreact.grad.clone()
 
# dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
 
dbngain = (dhpreact * bnraw).sum(0, keepdim=True)
 
dbnbias = dhpreact.sum(0, keepdim=True)
 
dbnraw = dhpreact * bngain
 
dbnvar_inv = (dbnraw * bndiff).sum(0, keepdim=True)
 
dbndiff = dbnraw * bnvar_inv
 
# dbnvar = dbnvar_inv * (-0.5) * bnvar_inv ** 3
 
dbnvar = dbnvar_inv * (-0.5) * (bnvar + 1e-5) ** -1.5
 
dbndiff2 = 1.0 / (n-1) * torch.ones_like(bndiff2) * dbnvar
 
dbndiff += 2 * bndiff * dbndiff2
 
dbnmeani = -dbndiff.sum(0, keepdim=True)
 
dhprebn = dbndiff.clone()
 
dhprebn += dbnmeani * 1.0 / n * torch.ones_like(hprebn)
 
dembcat = dhprebn @ W1.T
 
dW1 = embcat.T @ dhprebn
 
db1 = dhprebn.sum(0, keepdim=False)
 
demb = dembcat.view(emb.shape)
 
dC = torch.zeros_like(C)
 
for k in range(Xb.shape[0]):
 
for j in range(Xb.shape[1]):
 
ix = Xb[k,j]
 
dC[ix] += demb[k,j]

`dlogprobs`

Notation-wise, $d l o g p ro b s$ stands for the gradient of loss through $l o g p ro b s$ . To start we have the following

l oss = - \frac{1}{n} i = 1 \sum j \in Y_{b} \sum l o g p ro b s_{i, j}

which easily yields the gradient as follows.

(\frac{d l oss}{d l o g p ro b s})_{i, j} = ⎩ ⎨ ⎧ - \frac{1}{n}, 0, j \in Y_{b} otherwise

 
dlogprobs = torch.zeros_like(logprobs)
 
dlogprobs[range(n), Yb] = -1. / n

`dprobs`

Continue the process, we have

l o g p ro b s_{i, j} = lo g (p ro b s_{i, j})

Hence

(\frac{d l oss}{d p ro b s})_{i, j} = (\frac{d l oss}{d l o g p ro b s})_{i, j} \cdot (\frac{d l o g p ro b s}{d p ro b s})_{i, j} = (\frac{d l oss}{d l o g p ro b s})_{i, j} \cdot \frac{1}{p ro b s _{i, j}}

 
dprobs = dlogprobs / probs

`dcounts_sum_inv`

 
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

Note that probs = counts * counts_sum_inv is in fact

p ro b s_{i, j} = co u n t s_{i, j} \cdot cs i_{i}

For simplicity of notation, denote counts_sum_inv by csi.

Beware of the broadcasting by checking the shapes.

 
>> counts.shape, counts_sum_inv.shape
 
(torch.Size([32, 27]), torch.Size([32, 1]))

Hence

(\frac{d p ro b s}{d cs i})_{i} = j \sum co u n t s_{i, j}

and

(\frac{d l oss}{d cs i})_{i} = j \sum (\frac{d l oss}{d p ro b s})_{i, j} \cdot (\frac{d p ro b s}{d cs i})_{i}

 
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

`dcounts`

Note that from one also has

 
probs = counts * counts_sum_inv

which leads to

(\frac{d p ro b s}{d co u n t s})_{i, j} = cs i_{i}

However, other than contributing to loss through probs, counts also does that through counts_sum and then counts_sum_inv. The complete gradient of dcounts remains to be determined.

 
counts_sum_inv = counts_sum**-1

`dcounts_sum`

leads to

(\frac{d cs i}{d cs})_{i} = - c s_{i}^{- 2}

and then

(\frac{d l oss}{d co u n t s _ s u m})_{i} = - (\frac{d l oss}{d cs i})_{i} \cdot co u n t s_s u m_{i}^{- 2}

`dnorm_logits`

 
dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts

`dlogit_maxes`

 
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

`dlogits`

 
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

Backprop through `cross_entropy` but All in One Go

Basically what happens in the forward pass can be described by the following pseudocode

 
logprobs = log(norm(softmax(logits, 1), 1))
 
loss = -mean(logprobs[range(n), Yb])

To discuss derivative for each single element, for every $i$ , denote $(Y_{b})_{i}$ by $y$ . For simplicity of notation, denote $l o g i t s$ by $l g$ .

l oss = - \frac{1}{n} i = 1 \sum n l o g p ro b s_{i, y} = - \frac{1}{n} i = 1 \sum n lo g (p ro b s_{i, y}) = - \frac{1}{n} i = 1 \sum n lo g (\frac{exp { l g _{i, y} }}{\sum _{k} exp { l g _{i, k} }})

Keep in mind that there is supposed to be a subtracting the maximum of logits (in each row) in the numerator. Here it is omitted because it does not affect the gradient towards loss.

Now conduct chain rules to derive the derivatives. If $j \neq = y$ ,

(\frac{d l oss}{d l g})_{i, j} = - \frac{1}{n} \frac{\sum _{k} exp { l g _{i, k} }}{exp { l g _{i, y} }} \cdot (- 1) \cdot \frac{exp { l g _{i, y} }}{( \sum _{k} exp { l g _{i, k} } ) ^{2}} \cdot exp {l g_{i, j}}

which yields

(\frac{d l oss}{d l g})_{i, j} = \frac{1}{n} \cdot \frac{exp { l g _{i, j} }}{\sum k exp { l g _{i, k} }} = \frac{1}{n} \cdot softmax (l g_{i, \cdot})_{j}

If $j = y$ ,

(\frac{d l oss}{d l g})_{i, j} = - \frac{1}{n} \frac{\sum _{k} exp { l g _{i, k} }}{exp { l g _{i, y} }} \cdot \frac{exp { l g _{i, y} } \cdot \sum _{k} exp { l g _{i, k} } - exp { l g _{i, y} } ^{2}}{( \sum _{k} exp { l g _{i, k} } ) ^{2}}

which yields

(\frac{d l oss}{d l g})_{i, j} = - \frac{1}{n} + \frac{1}{n} \cdot \frac{exp { l g _{i, y} }}{\sum _{k} exp { l g _{i, k} }} = \frac{1}{n} \cdot (softmax (l g_{i, \cdot})_{y}) - 1

While the two cases are discussed separately, they share a common part in softmax.

 
dlogits = F.softmax(logits, 1)
 
dlogits[range(n), Yb] -= 1
 
dlogits /= n

WIP

finish calculating the rest derivatives

📚 I'm a divergent

Explorer

backpropagation

Derive Backprop Gradients step by step

`dlogprobs`

`dprobs`

`dcounts_sum_inv`

`dcounts`

`dcounts_sum`

`dnorm_logits`

`dlogit_maxes`

`dlogits`

Backprop through `cross_entropy` but All in One Go

Reference

Graph View

Table of Contents

Backlinks

📚 I'm a divergent

Explorer

backpropagation

Derive Backprop Gradients step by step

dlogprobs

dprobs

dcounts_sum_inv

dcounts

dcounts_sum

dnorm_logits

dlogit_maxes

dlogits

Backprop through cross_entropy but All in One Go

Reference

Graph View

Table of Contents

Backlinks

`dlogprobs`

`dprobs`

`dcounts_sum_inv`

`dcounts`

`dcounts_sum`

`dnorm_logits`

`dlogit_maxes`

`dlogits`

Backprop through `cross_entropy` but All in One Go