GPT
Tokenizer
Some issues with tokenizers
The corpus used to train tokenizer and that of pre-training stage may very well be different. Hence it could be the case that some tokens were really present in pre-training data, which means the corresponding embedding vectors of these tokens were seldom/never hit by back propagation so they never got updated or were not trained enough. Then when these tokens appear in the input, transformer essentially gets confused and doesn’t know what the most probable next-token is. Hence the all kinds of weird outputs.
Reproduce GPT
To reproduce GPT, one could start from Karpathy’s tutorial and his project nanoGPT. Besides reasonable training/inference code snippets, a substantial amount of data and corresponding compute units are required.
Llama
Model Structures
SwiGLU
nn.silu
applies the Sigmoid Linear Unit (SiLU) function, element-wise.
The SiLU function is also known as the swish function and is defined as:
RMSNorm
class LlamaRMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
LlamaRMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
In recent settings, RMSNorm is typically placed before a layer, a.k.a pre-norm. One possible explanation is the increase of stability for the training process.
In the original RMSNorm paper, the authors claimed that recentering step in layer norm yields insignificant boost performance, while removing it speeds up the computation noticeably.
GQA
Grouped query attention is essentially an interpolation of MHA and MQA with the goal of balancing the pros and cons. Tokens are first grouped together according to positions and share key and value vectors within groups.
Llama-2
Llama-3
Notable changes from Llama-2
- Llama-3 uses Grouped Query Attention with 8 key-value heads for reduced KV-cacheKV-Cache memory footprint and better inference speed.
tokenizer
Tip
Llama-3 uses a vocabulary of 128K tokens, 100K of which come from tiktoken tokenizer and the rest 28K are to better support non-English languages.
readme
Mistral
Mistral
Info
Mistral models adopt Llama2 architecture and various settings. Below are some noticeable add-ons.
sliding window attention
Sliding window attention(SWA) reduces the number of dot-product calculation performed, hence speeds up both training and inference.
On top of causal mask, in which every token only attends to all previous tokens(including itself), now it pulls back its attention even more, i.e., it only attends to most recent tokens(including itself).
Info
Sliding window attention still allows a token to attend to previous tokens that are outside the current window. For inspiration refer to the concept receptive field in CNN.
KV-Cache with rolling buffer cache
The size of KV-cache is fixed, set to be the sliding window width for convenience.
pre-filling and chunking
Instead of feeding one token at a time, the prompt tokens are pre-filled altogether, then chunked to size of sliding-window width.
Todo
maybe add a figure for better and easier understanding
Mixtral
Info
Mixtral-8x7b is regarded as SOTA MoE model which outperforms LLaMA-70b in various tasks while being roughly equivalent to a 40b model in terms of parameter size.
model structure
Caution
Sliding window attention was not supported for Mixtral and its variants. See official model config.
Mixture of Experts
Why not one expert but two?
Experiments indicate that at least two experts are required for the model to adapt to the routing process, i.e., the model may not be able to learn routing experts if trained with only one expert the whole time.
Inefficiency during training
For example, although large batch sizes are usually better for performance, batch sizes in MOEs are effectively reduced as data flows through the active experts. For example, if our batched input consists of 10 tokens, five tokens might end in one expert, and the other five tokens might end in five different experts, leading to uneven batch sizes and underutilization. read more
Fine-tune is more difficult because it is easier to hit overfitting. Active experts are exposed to more training data, hence trained more thoroughly.
Load balancing tokens for MoEs
As discussed before, if all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more.
To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance.
This loss ensures that all experts receive a roughly equal number of training examples.
In transformers, the auxiliary loss is exposed via the aux_loss
parameter.
readme
- Mixture of Experts Explained
- 全网最细致大模型MoE原理+代码手撕版
- makeMoE: Inplement a Sparse Mixture of Experts Language Model from Scratch
DeepSeek
DeepSeek-V3
MoE可能存在的问题: 专家负载不均衡,训练(反向传播)充分程度不一样。解决方法: 训练过程中监控各专家负载(如:传入token数),隔段时间调整偏置项,保证最后负载均衡。