Mixture of Experts (MoE)

Mixture of Experts is a neural network architecture where only a fraction of the model's parameters are active for any given input, routing each token to the subset of 'expert' networks most relevant to it. Instead of every token passing through all layers, a router network decides which experts to activate. GPT-4 and Mixtral use MoE architectures.

The advantage is parameter efficiency: you can have a model with 1 trillion total parameters, but only 50 billion active for any given inference. This gives you the capacity of a massive model at the computational cost of a smaller one.

The trade-off is memory: all expert parameters must be loaded into memory even though only some are used, requiring more GPU memory than a dense model of equivalent active parameters. Training MoE models is also harder, load balancing between experts is a persistent challenge, as the router tends to over-route to a few experts and underuse others.

But for inference at scale, MoE is increasingly dominant because it delivers high capability at manageable compute cost.

Interactive Concept: mixture of experts

Click on tokens to see how the router network selectively activates only the most relevant expert networks, achieving parameter efficiency by using a fraction of the total model capacity.

Top-K Experts:2

Input Tokens

Expert Networks

Grammar

Articles, pronouns

8.3B params

Animals

Animal-related words

8.3B params

Actions

Verbs, adverbs

8.3B params

Objects

Nouns, things

8.3B params

Descriptors

Adjectives

8.3B params

Spatial

Location, direction

8.3B params

Parameter Efficiency

49.8B

Total Parameters

0.0B

Active Parameters

100.0%

Parameters Saved

Only 0.0% of model parameters active per token

But for inference at scale, MoE is increasingly dominant because it delivers high capability at manageable compute cost.

Interactive Concept: mixture of experts