Mixture of Experts is a neural network architecture where only a fraction of the model's parameters are active for any given input, routing each token to the subset of 'expert' networks most relevant to it. Instead of every token passing through all layers, a router network decides which experts to activate. GPT-4 and Mixtral use MoE architectures. The advantage is parameter efficiency: you can have a model with 1 trillion total parameters, but only 50 billion active for any given inference. This gives you the capacity of a massive model at the computational cost of a smaller one. The trade-off is memory: all expert parameters must be loaded into memory even though only some are used, requiring more GPU memory than a dense model of equivalent active parameters. Training MoE models is also harder, load balancing between experts is a persistent challenge, as the router tends to over-route to a few experts and underuse others. But for inference at scale, MoE is increasingly dominant because it delivers high capability at manageable compute cost.
Back to Glossary