Batch Normalization

Batch normalization is a technique for stabilizing and accelerating neural network training by normalizing layer inputs to have zero mean and unit variance across each mini-batch. As networks train, the distribution of inputs to each layer shifts because earlier layers' parameters change, a phenomenon called internal covariate shift.

This instability forces careful learning rate selection and weight initialization. Batch normalization addresses this by explicitly normalizing inputs before each layer's activation function. During training, the mean and variance are computed from the current mini-batch. Learned scale and shift parameters allow the network to recover the optimal distribution for each layer.

At inference time, running averages of mean and variance (computed during training) are used instead of batch statistics. The benefits are substantial: networks train faster, tolerate higher learning rates, and are less sensitive to initialization. However, batch normalization creates dependencies between samples in a batch. Small batch sizes produce noisy statistics.

Batch size 1 has no batch to normalize over. For transformers and language models, Layer Normalization (normalizing across features rather than across batch) is preferred because it doesn't depend on batch composition. Understanding when to use batch vs. Layer normalization is important for different architectures and training scenarios.

Interactive Concept: batch normalization

Interactive visualization of how batch normalization stabilizes layer inputs during training

Layer:

Batch Size:4

Layer 1 Activations

5.15

10.55

3.95

14.30

Mean (μ)

8.49

Std Dev (σ)

4.18

Normalization Process

Original activations from previous layer

Calculate batch mean (μ)

Calculate batch variance (σ²)

Normalize: (x - μ) / √(σ² + ε)

Scale and shift: γ * x_norm + β

Current Formula:

x = layer_output

Click steps to explore the normalization process

Key Insight: Notice how different layers have different distributions. Batch normalization ensures each layer receives inputs with consistent statistics (μ=0, σ=1), preventing internal covariate shift and enabling faster, more stable training.

Related Terms

Transformer Inference

Interactive Concept: batch normalization