veda.ng
Back to Glossary

Layer Normalization

Layer normalization is a technique that stabilizes neural network training by normalizing activations across the feature dimension for each individual sample, ensuring activations have consistent mean and variance regardless of the input. For each sample independently, layer norm computes the mean and variance across all features, then normalizes to zero mean and unit variance, then applies learned scale and shift parameters. Unlike batch normalization which normalizes across the batch dimension, layer normalization's statistics are computed per-sample. This independence from batch statistics matters for transformers: batch norm fails with variable-length sequences and small batch sizes common in language modeling. Layer normalization keeps activations in a consistent range throughout training, preventing values from exploding or vanishing as they pass through many layers. Without normalization, deep networks are unstable, activations drift unpredictably, gradients become unreliable, and training diverges. Transformers apply layer normalization before or after attention and feedforward blocks, with 'Pre-LN' (before) proving more stable for very deep models. RMSNorm is a simplified variant using only the root mean square, omitting mean centering. Layer normalization is computationally cheap, adding minimal overhead while greatly improving training stability. It's now standard in transformer architectures.