Absolute Positional Encoding

Relative Positional Encoding

ALiBi

RoPE

NoPE

Read for some math?

A self-attention operator is permutation equivariant while an attention operator with learned query is permutation invariant.

See more

credits

Consider an image or feature map , where denotes the spatial dimension

and denotes the number of features. Let denote a permutation of elements. A transformation

is called a spatial permutation if ,

where denotes the permutation matrix associated with , defined as

with being

a one-hot vector of length and -th element as 1.

An operator is spatial permutation equivariant if

for any and any spatial permutation .

In addition, an operator is spatially invariant if

for any and any spatial permutation .

A self-attention operator $A_s$ is permutation equivariant while an attention operator with learned query $A_\mathbf{Q}$ is permutation invariant. In particular, denote by $X$ the input matrix and by $T$ any spatial permutation, we have

and

When applying a spatial permutation to the input

of a self-attetnion operator , we have

Note that since is an orthogonal matrix. It is also easy to verify that

for any matrix . Hence is spatial permutation equivariant.

Similarly, when applying to the input of an attention operator with a learned query ,

which is independent of the input , we have

Hence is spatial permutation invariant.

readme