跳转到内容

Positional Encoding

此内容尚不支持你的语言。

Absolute Positional Encoding

Relative Positional Encoding

ALiBi

RoPE

NoPE

See more

credits

Definition

Consider an image or feature map XRd×n\mathbf{X}\in\mathbb{R}^{d\times n}, where nn denotes the spatial dimension and dd denotes the number of features. Let π\pi denote a permutation of nn elements. A transformation T:Rd×nRd×nT: \mathbb{R}^{d\times n}\to\mathbb{R}^{d\times n} is called a spatial permutation if T(X)=XPπT(\mathbf{X})=\mathbf{X}P_{\pi}, where PπRn×nP_{\pi}\in\mathbb{R}^{n\times n} denotes the permutation matrix associated with π\pi, defined as Pπ=[eπ(1),eπ(2),eπ(n)]P_{\pi}=[\mathbf{e}_{\pi(1)},\mathbf{e}_{\pi(2)},\cdots\mathbf{e}_{\pi(n)}] with ei\mathbf{e}_{i} being a one-hot vector of length nn and ii-th element as 1.

Definition

An operator A:Rd×nRd×nA: \mathbb{R}^{d\times n}\to\mathbb{R}^{d\times n} is spatial permutation equivariant if Tπ(A(X))=A(Tπ(X))T_{\pi}(A(\mathbf{X}))=A(T_{\pi}(\mathbf{X})) for any X\mathbf{X} and any spatial permutation TπT_{\pi}. In addition, an operator A:Rd×nRd×nA: \mathbb{R}^{d\times n}\to\mathbb{R}^{d\times n} is spatially invariant if A(Tπ(X))=A(X)A(T_{\pi}(\mathbf{X}))=A(\mathbf{X}) for any X\mathbf{X} and any spatial permutation TπT_{\pi}.

Theorem

A self-attention operator AsA_s is permutation equivariant while an attention operator with learned query AQA_\mathbf{Q} is permutation invariant. In particular, denote by XX the input matrix and by TT any spatial permutation, we have

As(Tπ(X))=Tπ(As(X)),A_{s}(T_{\pi}(\mathbf{X}))=T_{\pi}(A_{s}(\mathbf{X})),

and

AQ(Tπ(X))=AQ(X).A_{\mathbf{Q}}(T_{\pi}(\mathbf{X}))=A_{\mathbf{Q}}(\mathbf{X}).
Proof.

When applying a spatial permutation TπT_{\pi} to the input X\mathbf{X} of a self-attetnion operator AsA_{s}, we have

As(Tπ(X))=WvTπ(X)softmax((WkTπ(X))TWvTπ(X))=WvXPπsoftmax((WkXPπ)TWqXPπ)=WvXPπsoftmax(PπT(WkX)TWqXPπ)=WvXPπPπTsoftmax((WkX)TWqX)Pπ=WvXsoftmax((WkX)TWqX)Pπ=Tπ(As(X)).\begin{aligned} A_{s}(T_{\pi}(\mathbf{X})) &= \mathbf{W}_{v}T_{\pi}(\mathbf{X})\cdot \text{softmax}\left((\mathbf{W}_{k}T_{\pi}(\mathbf{X}))^{T}\cdot\mathbf{W}_{v}T_{\pi}(\mathbf{X})\right) \\ &=\mathbf{W}_{v}\mathbf{X}P_{\pi}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X}P_{\pi})^{T}\cdot\mathbf{W}_{q}\mathbf{X}P_{\pi}\right) \\ &=\mathbf{W}_{v}\mathbf{X}P_{\pi}\cdot\text{softmax}\left(P_{\pi}^{T}(\mathbf{W}_{k}\mathbf{X})^{T}\cdot\mathbf{W}_{q}\mathbf{X}P_{\pi}\right) \\ &=\mathbf{W}_{v}\mathbf{X}P_{\pi}P_{\pi}^{T}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot\mathbf{W}_{q}\mathbf{X}\right)P_{\pi} \\ &=\mathbf{W}_{v}\mathbf{X}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot\mathbf{W}_{q}\mathbf{X}\right)P_{\pi} \\ &= T_{\pi}(A_{s}(\mathbf{X})). \end{aligned}

Note that PπTPπ=IP_{\pi}^{T}P_{\pi}=\mathbf{I} since PπP_{\pi} is an orthogonal matrix. It is also easy to verify that

softmax(PπTMPπ)=PπTsoftmax(M)Pπ\text{softmax}(P_{\pi}^{T}\mathbf{M}P_{\pi})=P_{\pi}^{T}\text{softmax}(\mathbf{M})P_{\pi}

for any matrix M\mathbf{M}. Hence AsA_{s} is spatial permutation equivariant.

Similarly, when applying TπT_{\pi} to the input of an attention operator AQA_{Q} with a learned query Q\mathbf{Q}, which is independent of the input X\mathbf{X}, we have

AQ(Tπ(X))=WvTπ(X)softmax((WkTπ(X))TQ)=WvX(PπPπT)softmax((WkX)TQ)=WvXsoftmax((WkX)TQ)=AQ(X).\begin{aligned} A_{Q}(T_{\pi}(\mathbf{X})) &= \mathbf{W}_{v}T_{\pi}(\mathbf{X})\cdot\text{softmax}\left((\mathbf{W}_{k}T_{\pi}(\mathbf{X}))^{T}\cdot \mathbf{Q}\right) \\ &= \mathbf{W}_{v}\mathbf{X}(P_{\pi}P_{\pi}^{T})\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot \mathbf{Q}\right) \\ &= \mathbf{W}_{v}\mathbf{X}\cdot\text{softmax}\left((\mathbf{W}_{k}\mathbf{X})^{T}\cdot \mathbf{Q}\right) \\ &= A_{Q}(\mathbf{X}). \end{aligned}

Hence AQA_{Q} is spatial permutation invariant.


read more