Feedforward Network | Glossary

A feedforward network in the transformer architecture is a simple two-layer neural network applied independently and identically to each position in the sequence after the attention mechanism has mixed information across positions.

The typical structure is: linear projection from hidden dimension d to a larger intermediate dimension (often 4d), nonlinear activation function like GELU or SwiGLU, then linear projection back to d. This expand-activate-contract pattern provides nonlinearity and increases the model's capacity to learn complex functions.

Each position processes through the same feedforward weights, maintaining the position-independent property of transformers. The attention mechanism handles information mixing between positions; the feedforward network handles nonlinear transformation within each position.

Research suggests feedforward layers store factual knowledge: specific neurons activate for specific concepts, and editing these neurons can update the model's knowledge. The feedforward layers contain the majority of a transformer's parameters, often two-thirds of the total.

Modern architectures experiment with larger feedforward expansions, gated variants like SwiGLU that improve performance, and mixture-of-experts feedforward layers where different experts specialize in different inputs. Understanding the feedforward network's role clarifies the division of labor in transformers.

Interactive Concept: feedforward network

Feedforward Network Visualizer

Interactive visualization of the feedforward network in transformers. Click positions in the sequence and watch the expand-activate-contract pattern.

Hidden Dimension:4

Intermediate Dimension:16

Processing: "The" (Position 0)

Input (d=4)

Intermediate (4d=16)

GELU Activation

Output (d=4)

Current Step:

Input: Hidden representations from attention layer

Key Properties

• Applied identically to each position
• Expansion ratio typically 4x hidden dimension
• Provides computational capacity and nonlinearity
• No interaction between sequence positions

Mathematical Flow

x ∈ ℝᵈ (input)

W₁x + b₁ ∈ ℝ⁴ᵈ (expand)

GELU(W₁x + b₁) (activate)

W₂ · GELU(W₁x + b₁) + b₂ ∈ ℝᵈ (contract)

Related Terms

Transformer Neural Network