Positional Encoding

Consider an image or feature map $\mathbf{X}\in\mathbb{R}^{d\times n}$ , where $n$ denotes the spatial dimension and $d$ denotes the number of features. Let $\pi$ denote a permutation of $n$ elements. A transformation $T: \mathbb{R}^{d\times n}\to\mathbb{R}^{d\times n}$ is called a spatial permutation if $T(\mathbf{X})=\mathbf{X}P_{\pi}$ , where $P_{\pi}\in\mathbb{R}^{n\times n}$ denotes the permutation matrix associated with $\pi$ , defined as $P_{\pi}=[\mathbf{e}_{\pi(1)},\mathbf{e}_{\pi(2)},\cdots\mathbf{e}_{\pi(n)}]$ with $\mathbf{e}_{i}$ being a one-hot vector of length $n$ and $i$ -th element as 1.

Definition

An operator $A: \mathbb{R}^{d\times n}\to\mathbb{R}^{d\times n}$ is spatial permutation equivariant if $T_{\pi}(A(\mathbf{X}))=A(T_{\pi}(\mathbf{X}))$ for any $\mathbf{X}$ and any spatial permutation $T_{\pi}$ . In addition, an operator $A: \mathbb{R}^{d\times n}\to\mathbb{R}^{d\times n}$ is spatially invariant if $A(T_{\pi}(\mathbf{X}))=A(\mathbf{X})$ for any $\mathbf{X}$ and any spatial permutation $T_{\pi}$ .

Theorem

A self-attention operator $A_s$ is permutation equivariant while an attention operator with learned query $A_\mathbf{Q}$ is permutation invariant. In particular, denote by $X$ the input matrix and by $T$ any spatial permutation, we have

A_{s}(T_{\pi}(\mathbf{X}))=T_{\pi}(A_{s}(\mathbf{X})),

and

A_{\mathbf{Q}}(T_{\pi}(\mathbf{X}))=A_{\mathbf{Q}}(\mathbf{X}).

Proof.

When applying a spatial permutation $T_{\pi}$ to the input $\mathbf{X}$ of a self-attetnion operator $A_{s}$ , we have

\begin{aligned} A_{s}(T_{\pi}(\mathbf{X})) &= \mathbf{W}_{v}T_{\pi}(\mathbf{X})\cdot \text{softmax}\left((\mathbf{W}_{k}T_{\pi}(\mathbf{X}))^{T}\cdot\mathbf{W}_{v}T_{\pi}(\mathbf{X})\right) \\ &=\mathbf{W}_{v}\mathbf{X}P_{\pi}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X}P_{\pi})^{T}\cdot\mathbf{W}_{q}\mathbf{X}P_{\pi}\right) \\ &=\mathbf{W}_{v}\mathbf{X}P_{\pi}\cdot\text{softmax}\left(P_{\pi}^{T}(\mathbf{W}_{k}\mathbf{X})^{T}\cdot\mathbf{W}_{q}\mathbf{X}P_{\pi}\right) \\ &=\mathbf{W}_{v}\mathbf{X}P_{\pi}P_{\pi}^{T}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot\mathbf{W}_{q}\mathbf{X}\right)P_{\pi} \\ &=\mathbf{W}_{v}\mathbf{X}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot\mathbf{W}_{q}\mathbf{X}\right)P_{\pi} \\ &= T_{\pi}(A_{s}(\mathbf{X})). \end{aligned}

Note that $P_{\pi}^{T}P_{\pi}=\mathbf{I}$ since $P_{\pi}$ is an orthogonal matrix. It is also easy to verify that

\text{softmax}(P_{\pi}^{T}\mathbf{M}P_{\pi})=P_{\pi}^{T}\text{softmax}(\mathbf{M})P_{\pi}

for any matrix $\mathbf{M}$ . Hence $A_{s}$ is spatial permutation equivariant.

Similarly, when applying $T_{\pi}$ to the input of an attention operator $A_{Q}$ with a learned query $\mathbf{Q}$ , which is independent of the input $\mathbf{X}$ , we have

\begin{aligned} A_{Q}(T_{\pi}(\mathbf{X})) &= \mathbf{W}_{v}T_{\pi}(\mathbf{X})\cdot\text{softmax}\left((\mathbf{W}_{k}T_{\pi}(\mathbf{X}))^{T}\cdot \mathbf{Q}\right) \\ &= \mathbf{W}_{v}\mathbf{X}(P_{\pi}P_{\pi}^{T})\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot \mathbf{Q}\right) \\ &= \mathbf{W}_{v}\mathbf{X}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot \mathbf{Q}\right) \\ &= A_{Q}(\mathbf{X}). \end{aligned}

Hence $A_{Q}$ is spatial permutation invariant.

位置编码一篇质量高的综述

Positional Encoding

Absolute Positional Encoding

Relative Positional Encoding

ALiBi

RoPE

NoPE