Before transformers

Problems with RNN

  • computation becomes really slow in long sequences
  • vanishing/exploding gradients
  • difficult to attend to information from further position in sequence

Attention mechanism

can Q, K, V have different shape?

Q and K should have the same shape in corresponding dimensions, while V can theoretically have different shape. In reality people set all three to have the same vector shape most of the time.

Attention mechanism allows the model to attend every token in the sequence with different amount of focus for each token.

scaled dot-product attention

Before applying softmax to the dot product attention, it should be scaled by a factor of to avoid gradient vanishing and slow training.

self-attention

self-attention is permutation invariant

mask interactions between two tokens by setting the attention values to before softmax layer.

cross-attention

In self-attention, we are working with the same input sequence. While in cross-attention, we are mixing or combining two different input sequences. In the case of the vanilla transformer architecture, that’s the sequence returned by the last/top encoder layer on the left and the input sequence being processed by the decoder part on the right.

that the queries usually come from the decoder, and the keys and values typically come from the encoder.

causal/masked attention

Calculating Transformers Parameters

Reference