Dropout is a regularization technique that randomly deactivates a fraction of neurons during each training step, forcing the network to learn redundant representations that generalize better. During training, each neuron has a probability p (typically 0.1 to 0.5) of being temporarily removed along with all its connections. This prevents neurons from co-adapting, relying too heavily on specific other neurons that might not be present. Each training iteration sees a different random subset of neurons, effectively training an ensemble of thinner networks that share weights. At inference time, all neurons are active but their outputs are scaled by (1-p) to maintain consistent magnitude. This approximates averaging predictions across the ensemble of possible dropout configurations. Dropout was introduced by Hinton et al. in 2012 and became a standard component of deep learning. It's particularly effective for fully connected layers where co-adaptation is most problematic. Transformers use dropout after attention layers and within feedforward blocks. Modern architectures sometimes use DropPath (dropping entire residual connections) or attention dropout. The dropout rate is a hyperparameter: too low provides insufficient regularization, too high destroys too much signal for learning. Dropout approximately doubles training time because the effective model capacity is reduced at each step, but the regularization benefit usually outweighs this cost.
Back to Glossary