GPT

Tokenizer

Some issues with tokenizers

The corpus used to train tokenizer and that of pre-training stage may very well be different. Hence it could be the case that some tokens were really present in pre-training data, which means the corresponding embedding vectors of these tokens were seldom/never hit by back propagation so they never got updated or were not trained enough. Then when these tokens appear in the input, transformer essentially gets confused and doesn’t know what the most probable next-token is. Hence the all kinds of weird outputs.

Reproduce GPT

To reproduce GPT, one could start from Karpathy’s tutorial and his project nanoGPT. Besides reasonable training/inference code snippets, a substantial amount of data and corresponding compute units are required.

Llama

Model Structures

SwiGLU

nn.silu applies the Sigmoid Linear Unit (SiLU) function, element-wise. The SiLU function is also known as the swish function and is defined as:

RMSNorm

transformers/models/llama/modeling_llama.py
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps
 
    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)

In recent settings, RMSNorm is typically placed before a layer, a.k.a pre-norm. One possible explanation is the increase of stability for the training process.

In the original RMSNorm paper, the authors claimed that recentering step in layer norm yields insignificant boost performance, while removing it speeds up the computation noticeably.

GQA

Grouped query attention is essentially an interpolation of MHA and MQA with the goal of balancing the pros and cons. Tokens are first grouped together according to positions and share key and value vectors within groups.

Llama-2

Llama-3

Notable changes from Llama-2

  • Llama-3 uses Grouped Query Attention with 8 key-value heads for reduced KV-cacheKV-Cache memory footprint and better inference speed.

tokenizer

Tip

Llama-3 uses a vocabulary of 128K tokens, 100K of which come from tiktoken tokenizer and the rest 28K are to better support non-English languages.

readme

Mistral

Mistral

Info

Mistral models adopt Llama2 architecture and various settings. Below are some noticeable add-ons.

sliding window attention

Sliding window attention(SWA) reduces the number of dot-product calculation performed, hence speeds up both training and inference.

On top of causal mask, in which every token only attends to all previous tokens(including itself), now it pulls back its attention even more, i.e., it only attends to most recent tokens(including itself).

Info

Sliding window attention still allows a token to attend to previous tokens that are outside the current window. For inspiration refer to the concept receptive field in CNN.

KV-Cache with rolling buffer cache

The size of KV-cache is fixed, set to be the sliding window width for convenience.

pre-filling and chunking

Instead of feeding one token at a time, the prompt tokens are pre-filled altogether, then chunked to size of sliding-window width.

Todo

maybe add a figure for better and easier understanding

Mixtral

Info

Mixtral-8x7b is regarded as SOTA MoE model which outperforms LLaMA-70b in various tasks while being roughly equivalent to a 40b model in terms of parameter size.

model structure

Caution

Sliding window attention was not supported for Mixtral and its variants. See official model config.

Mixture of Experts

Why not one expert but two?

Experiments indicate that at least two experts are required for the model to adapt to the routing process, i.e., the model may not be able to learn routing experts if trained with only one expert the whole time.

Inefficiency during training

For example, although large batch sizes are usually better for performance, batch sizes in MOEs are effectively reduced as data flows through the active experts. For example, if our batched input consists of 10 tokens, five tokens might end in one expert, and the other five tokens might end in five different experts, leading to uneven batch sizes and underutilization. read more

Fine-tune is more difficult because it is easier to hit overfitting. Active experts are exposed to more training data, hence trained more thoroughly.

Load balancing tokens for MoEs

As discussed before, if all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more.

To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. In transformers, the auxiliary loss is exposed via the aux_loss parameter.

readme

DeepSeek

DeepSeek-V3

MoE可能存在的问题: 专家负载不均衡,训练(反向传播)充分程度不一样。解决方法: 训练过程中监控各专家负载(如:传入token数),隔段时间调整偏置项,保证最后负载均衡。