Multi-query attention( MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. In the original paper, the authors introduce grouped-query attention(GQA), a generalization of MQA which uses an intermediate (more than one, less than number of query heads) number of key-value heads.

Reference

https://klu.ai/glossary/grouped-query-attention

📚 I'm a divergent

Explorer

GQA

Reference

Graph View

Backlinks