veda.ng
Back to Glossary

Batch Normalization

Batch normalization is a technique for stabilizing and accelerating neural network training by normalizing layer inputs to have zero mean and unit variance across each mini-batch. As networks train, the distribution of inputs to each layer shifts because earlier layers' parameters change, a phenomenon called internal covariate shift. This instability forces careful learning rate selection and weight initialization. Batch normalization addresses this by explicitly normalizing inputs before each layer's activation function. During training, the mean and variance are computed from the current mini-batch. Learned scale and shift parameters allow the network to recover the optimal distribution for each layer. At inference time, running averages of mean and variance (computed during training) are used instead of batch statistics. The benefits are substantial: networks train faster, tolerate higher learning rates, and are less sensitive to initialization. However, batch normalization creates dependencies between samples in a batch. Small batch sizes produce noisy statistics. Batch size 1 has no batch to normalize over. For transformers and language models, Layer Normalization (normalizing across features rather than across batch) is preferred because it doesn't depend on batch composition. Understanding when to use batch vs. layer normalization is important for different architectures and training scenarios.