MoE in language models

Why not one expert but two?

Experiments indicate that at least two experts are required for the model to adapt to the routing process, i.e., the model may not be able to learn routing experts if trained with only one expert the whole time.

Inefficiency during training

For example, although large batch sizes are usually better for performance, batch sizes in MOEs are effectively reduced as data flows through the active experts. For example, if our batched input consists of 10 tokens, five tokens might end in one expert, and the other five tokens might end in five different experts, leading to uneven batch sizes and underutilization.1

Fine-tune is more difficult because it is easier to hit overfitting. Active experts are exposed to more training data, hence trained more thoroughly.

Load balancing tokens for MoEs

As discussed before, if all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more.

To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. In transformers, the auxiliary loss is exposed via the aux_loss parameter.

Reference

6

Footnotes

  1. https://huggingface.co/blog/moe?continueFlag=a09556ebd7121bce97f7bbb8eb2598c8