A sparse expert is an individual specialized neural network module within a Mixture of Experts (MoE) architecture, where only a subset of experts activate for any given input rather than all parameters participating in every computation. Each expert is typically a feedforward network with its own weights, learning to handle different types of inputs or patterns. A gating network (router) examines each input and decides which experts should process it, usually 1-4 experts out of dozens or hundreds. The 'sparse' designation means most experts remain dormant for any given token, greatly reducing computational cost while maintaining access to massive total parameter counts. This enables scaling to models with trillions of parameters where inference remains feasible because only a fraction activates per forward pass. Experts develop implicit specialization through training: some become better at code, others at scientific text, others at conversational patterns. This emergent division of labor allows the model to apply relevant expertise selectively. Training challenges include load balancing (ensuring work distributes evenly across experts rather than collapsing to a few overworked specialists) and the complexity of routing decisions. Google's Switch Transformer, GLaM, and Mixtral demonstrate sparse expert architectures achieving strong performance with favorable compute efficiency. The approach represents a middle ground between dense models (all parameters always active) and pure modularity (completely separate specialist models).
Back to Glossary