Absolute Positional Encoding
Relative Positional Encoding
Read for some math?
A self-attention operator is permutation equivariant while an attention operator with learned query is permutation invariant.
See more
Consider an image or feature map , where denotes the spatial dimension
and denotes the number of features. Let denote a permutation of elements. A transformation
is called a spatial permutation if ,
where denotes the permutation matrix associated with , defined as
with being
a one-hot vector of length and -th element as 1.
An operator is spatial permutation equivariant if
for any and any spatial permutation .
In addition, an operator is spatially invariant if
for any and any spatial permutation .
When applying a spatial permutation to the input
of a self-attetnion operator , we have
Note that since is an orthogonal matrix. It is also easy to verify that
for any matrix . Hence is spatial permutation equivariant.
Similarly, when applying to the input of an attention operator with a learned query ,
which is independent of the input , we have
Hence is spatial permutation invariant.