GeLU

For more detail, check GeLU.

RMSNorm

In recent settings, RMSNorm is typically placed before a layer, a.k.a pre-norm. One possible explanation is the increase of stability for the training process.

In the original RMSNorm paper, the authors claimed that recentering step in layer norm yields insignificant boost performance, while removing it speeds up the computation noticeably.

GQA

GQA(Grouped Query Attention), is a balance between multi-head attention(MHA) from vanilla transformers and MQA(Multi Query Attention), leveraging the advantages from both sides.

Grouped query attention is somewhat an interpolation of MHA and MQA with the goal of balancing the pros and cons. Tokens are first grouped together according to positions and share key and value vectors within groups.

The following table is a direct comparison between dimensions of query, key, value and output for MHA and MQA. $B$ is batch size, $T$ is sequence length, $d$ is embedding dimension, $n_heads$ is number of heads, $d_{k} = d / n_heads$ is query/key/value vector dimension, $n_kv_heads$ is the number of key/value heads and $G = n_heads/n_kv_heads$ is the number of groups.

	MHA	MQA	GQA
$X$	$(B, T, d)$	$(B, T, d)$	$(B, T, d)$
$Q$	$(B, n_heads, T, d_{k})$	$(B, n_heads, T, d_{k})$	$(B, G, n_kv_heads, T, d_{k})$
$K$	$(B, n_heads, T, d_{k})$	$(B, 1, T, d_{k})$	$(B, n_kv_heads, T, d_{k})$
$V$	$(B, n_heads, T, d_{k})$	$(B, 1, T, d_{k})$	$(B, n_kv_heads, T, d_{k})$
$O$	$(B, T, d)$	$(B, T, d)$	$(B, n_kv_heads, T, d_{k})$
$W_{O}$	$(d, d)$	$(d, d)$	$n_kv_heads, d$
Y	$(B, T, d)$	$(B, T, d)$	$(B, T, d)$

📚 I'm a divergent

Explorer

llama2

GeLU

RMSNorm

GQA

Graph View

Table of Contents

Backlinks