Skip to content

Open-weight Models

GPT

Tokenizer

Some issues with tokenizers

The corpus used to train tokenizer and that of pre-training stage may very well be different. Hence it could be the case that some tokens were really present in pre-training data, which means the corresponding embedding vectors of these tokens were seldom/never hit by back propagation so they never got updated or were not trained enough. Then when these tokens appear in the input, transformer essentially gets confused and doesn’t know what the most probable next-token is. Hence the all kinds of weird outputs.

Reproduce GPT

To reproduce GPT, one could start from Karpathy’s tutorial and his project nanoGPT. Besides reasonable training/inference code snippets, a substantial amount of data and corresponding compute units are required.

  • Let's build GPT: from scratch, in code, spelled out.

  • Llama

    Model Structures

    SwiGLU

    nn.silu applies the Sigmoid Linear Unit (SiLU) function, element-wise. The SiLU function is also known as the swish function and is defined as:

    f(x)=xsigmoid(x)f(x) = x \cdot \text{sigmoid}(x)

    RMSNorm

    transformers/models/llama/modeling_llama.py
    1
    class LlamaRMSNorm(nn.Module):
    2
    def __init__(self, hidden_size, eps=1e-6):
    3
    """
    4
    LlamaRMSNorm is equivalent to T5LayerNorm
    5
    """
    6
    super().__init__()
    7
    self.weight = nn.Parameter(torch.ones(hidden_size))
    8
    self.variance_epsilon = eps
    9
    10
    def forward(self, hidden_states):
    11
    input_dtype = hidden_states.dtype
    12
    hidden_states = hidden_states.to(torch.float32)
    13
    variance = hidden_states.pow(2).mean(-1, keepdim=True)
    14
    hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
    15
    return self.weight * hidden_states.to(input_dtype)

    In recent settings, RMSNorm is typically placed before a layer, a.k.a pre-norm. One possible explanation is the increase of stability for the training process.

    In the original RMSNorm paper, the authors claimed that recentering step in layer norm yields insignificant boost performance, while removing it speeds up the computation noticeably.

    GQA

    Grouped query attention is essentially an interpolation of MHA and MQA with the goal of balancing the pros and cons. Tokens are first grouped together according to positions and share key and value vectors within groups.

    Llama-2

    Llama-3

    tokenizer


    read more

    Mistral

    Mistral

    sliding window attention

    Sliding window attention(SWA) reduces the number of dot-product calculation performed, hence speeds up both training and inference.

    On top of causal mask, in which every token only attends to all previous tokens(including itself), now it pulls back its attention even more, i.e., it only attends to kk most recent tokens(including itself).

    KV-Cache with rolling buffer cache

    The size of KV-cache is fixed, set to be the sliding window width for convenience.

    pre-filling and chunking

    Instead of feeding one token at a time, the prompt tokens are pre-filled altogether, then chunked to size of sliding-window width.

    Mixtral

    model structure

    Mixture of Experts

    Inefficiency during training

    For example, although large batch sizes are usually better for performance, batch sizes in MOEs are effectively reduced as data flows through the active experts. For example, if our batched input consists of 10 tokens, five tokens might end in one expert, and the other five tokens might end in five different experts, leading to uneven batch sizes and underutilization. read more

    Fine-tune is more difficult because it is easier to hit overfitting. Active experts are exposed to more training data, hence trained more thoroughly.

    Load balancing tokens for MoEs

    As discussed before, if all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more.

    To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. In transformers, the auxiliary loss is exposed via the aux_loss parameter.


    read more