Absolute Positional Encoding

Relative Positional Encoding

ALiBi

RoPE

NoPE

Read for some math?

A self-attention operator ðī𝑠 is permutation equivariant while an attention operator with learned query ðī𝑄 is permutation invariant.

See more

credits

Consider an image or feature map 𝑋∈ℝ𝑑×𝑛, where 𝑛 denotes the spatial dimension and 𝑑 denotes the number of features. Let 𝜋 denote a permutation of 𝑛 elements. A transformation 𝑇:ℝ𝑑×𝑛→ℝ𝑑×𝑛 is called a spatial permutation if 𝑇(𝑋)=𝑋𝑃𝜋, where 𝑃𝜋∈ℝ𝑛×𝑛 denotes the permutation matrix associated with 𝜋, defined as 𝑃𝜋=[𝑒𝜋(1),𝑒𝜋(2),â‹Ŋ𝑒𝜋(𝑛)] with 𝑒𝑖 being a one-hot vector of length 𝑛 and 𝑖-th element as 1.

definition

An operator ðī:ℝ𝑑×𝑛→ℝ𝑑×𝑛 is spatial permutation equivariant if 𝑇𝜋(ðī(𝑋))=ðī(𝑇𝜋(𝑋)) for any 𝑋 and any spatial permutation 𝑇𝜋. In addition, an operator ðī:𝑋𝑑×𝑛→𝑋𝑑×𝑛 is spatially invariant if ðī(𝑇𝜋(𝑋))=ðī(𝑋) for any 𝑋 and any spatial permutation 𝑇𝜋.

Algorithm

A self-attention operator ðī𝑠 is permutation equivariant while an attention operator with learned query ðī𝑄 is permutation invariant. In particular, denote by 𝑋 the input matrix and by 𝑇 any spatial permutation, we have

ðī𝑠(𝑇𝜋(𝑋))=𝑇𝜋(ðī𝑠(𝑋)),

and

ðī𝑄(𝑇𝜋(𝑋))=ðī𝑄(𝑋).

When applying a spatial permutation 𝑇𝜋 to the input 𝑋 of a self-attetnion operator ðī𝑠, we have

ðī𝑠(𝑇𝜋(𝑋))=𝑊ð‘Ģ𝑇𝜋(𝑋)⋅softmax((𝑊𝑘𝑇𝜋(𝑋))𝑇⋅𝑊ð‘Ģ𝑇𝜋(𝑋))=𝑊ð‘Ģ𝑋𝑃𝜋⋅softmax((𝑊𝑘𝑋𝑃𝜋)𝑇⋅𝑊𝑞𝑋𝑃𝜋)=𝑊ð‘Ģ𝑋𝑃𝜋⋅softmax𝑃𝑇(𝑊𝑘𝑋)𝑇𝜋⋅𝑊𝑞𝑋𝑃𝜋)=𝑊ð‘Ģ𝑋𝑃𝜋𝑃𝑇𝜋⋅softmax((𝑊𝑘𝑋)𝑇⋅𝑊𝑞𝑋)𝑃𝜋=𝑊ð‘Ģ𝑋⋅softmax((𝑊𝑘𝑋)𝑇⋅𝑊𝑞𝑋)𝑃𝜋=𝑇𝜋(ðī𝑠(𝑋)).

Note that 𝑃𝑇𝜋𝑃𝜋=𝐞 since 𝑃𝜋 is an orthogonal matrix. It is also easy to verify that

softmax(𝑃𝑇𝜋𝑀𝑃𝜋)=𝑃𝑇𝜋softmax(𝑀)𝑃𝜋

for any matrix 𝑀. Hence ðī𝑠 is spatial permutation equivariant. Similarly, when applying 𝑇𝜋 to the input of an attention operator ðī𝑄 with a learned query 𝑄, which is independent of the input 𝑋, we have

ðī𝑄(𝑇𝜋(𝑋))=𝑊ð‘Ģ𝑇𝜋(𝑋)⋅softmax((𝑊𝑘𝑇𝜋(𝑋))𝑇⋅𝑄)=𝑊ð‘Ģ𝑋(𝑃𝜋𝑃𝑇𝜋)⋅softmax((𝑊𝑘𝑋)𝑇⋅𝑄)=𝑊ð‘Ģ𝑋⋅softmax((𝑊𝑘𝑋)𝑇⋅𝑄)=ðī𝑄(𝑋).

Hence ðī𝑄 is spatial permutation invariant.

readme