A feedforward network in the transformer architecture is a simple two-layer neural network applied independently and identically to each position in the sequence after the attention mechanism has mixed information across positions. The typical structure is: linear projection from hidden dimension d to a larger intermediate dimension (often 4d), nonlinear activation function like GELU or SwiGLU, then linear projection back to d. This expand-activate-contract pattern provides nonlinearity and increases the model's capacity to learn complex functions. Each position processes through the same feedforward weights, maintaining the position-independent property of transformers. The attention mechanism handles information mixing between positions; the feedforward network handles nonlinear transformation within each position. Research suggests feedforward layers store factual knowledge: specific neurons activate for specific concepts, and editing these neurons can update the model's knowledge. The feedforward layers contain the majority of a transformer's parameters, often two-thirds of the total. Modern architectures experiment with larger feedforward expansions, gated variants like SwiGLU that improve performance, and mixture-of-experts feedforward layers where different experts specialize in different inputs. Understanding the feedforward network's role clarifies the division of labor in transformers.
Back to Glossary