Mistral

Mistral models adopt Llama2 architecture and various settings. Below are some noticeable add-ons.

sliding window attention

Sliding window attention(SWA) reduces the number of dot-product calculation performed, hence speeds up both training and inference.

On top of causal mask, in which every token only attends to all previous tokens(including itself), now it pulls back its attention even more, i.e., it only attends to most recent tokens(including itself).

Sliding window attention still allows a token to attend to previous tokens that are outside the current window. For inspiration refer to the concept receptive field in CNN.

KV-Cache with rolling buffer cache

The size of KV-cache is fixed, set to be the sliding window width for convenience.

pre-filling and chunking

Instead of feeding one token at a time, the prompt tokens are pre-filled altogether, then chunked to size of sliding-window width.

maybe add a figure for better and easier understanding

Mixtral

Mixtral-8x7b is regarded as SOTA MoE model which outperforms LLaMA-70b in various tasks while being roughly equivalent to a 40b model in terms of parameters.

Model Structure

Sliding window attention was not supported for Mixtral and its variants. See official model config.

Sparse MoE