Before transformers
Problems with RNN
- computation becomes really slow in long sequences
- vanishing/exploding gradients
- difficult to attend to information from further position in sequence
Attention mechanism
can Q, K, V have different shape?
Q and K should have the same shape in corresponding dimensions, while V can theoretically have different shape. In reality people set all three to have the same vector shape most of the time.
Attention mechanism allows the model to attend every token in the sequence with different amount of focus for each token.
scaled dot-product attention
Before applying softmax to the dot product attention, it should be scaled by a factor of to avoid gradient vanishing and slow training.
self-attention
self-attention is permutation invariant
mask interactions between two tokens by setting the attention values to before softmax
layer.
cross-attention
In self-attention, we are working with the same input sequence. While in cross-attention, we are mixing or combining two different input sequences. In the case of the vanilla transformer architecture, that’s the sequence returned by the last/top encoder layer on the left and the input sequence being processed by the decoder part on the right.
that the queries usually come from the decoder, and the keys and values typically come from the encoder.
causal/masked attention
Calculating Transformers Parameters
Reference