Proof.
When applying a spatial permutation Tπ to the input X
of a self-attetnion operator As, we have
As(Tπ(X))=WvTπ(X)⋅softmax((WkTπ(X))T⋅WvTπ(X))=WvXPπ⋅softmax((WkXPπ)T⋅WqXPπ)=WvXPπ⋅softmax(PπT(WkX)T⋅WqXPπ)=WvXPπPπT⋅softmax((WkX)T⋅WqX)Pπ=WvX⋅softmax((WkX)T⋅WqX)Pπ=Tπ(As(X)).Note that PπTPπ=I since Pπ is an orthogonal matrix. It is also easy to verify that
softmax(PπTMPπ)=PπTsoftmax(M)Pπfor any matrix M. Hence As is spatial permutation equivariant.
Similarly, when applying Tπ to the input of an attention operator AQ with a learned query Q,
which is independent of the input X, we have
AQ(Tπ(X))=WvTπ(X)⋅softmax((WkTπ(X))T⋅Q)=WvX(PπPπT)⋅softmax((WkX)T⋅Q)=WvX⋅softmax((WkX)T⋅Q)=AQ(X).Hence AQ is spatial permutation invariant.