veda.ng
Back to Glossary

Activation Function

An activation function is a mathematical transformation applied to each neuron's output in a neural network, introducing the non-linearity essential for learning complex patterns. Without activation functions, any sequence of linear transformations collapses to a single linear transformation, no matter how many layers you stack, the network remains a linear model. Non-linear activations enable networks to approximate arbitrary functions, forming the theoretical basis for deep learning's power. ReLU (Rectified Linear Unit) returns max(0, x), zeroing negative values while passing positive values unchanged. Its simplicity and effectiveness made it the default for most deep learning. However, ReLU suffers from 'dying neurons', neurons that output zero and stop learning. Variants like Leaky ReLU (small negative slope instead of zero) and Parametric ReLU address this. Modern transformers typically use GELU (Gaussian Error Linear Unit) or SiLU/Swish, which have smooth gradients and often improve performance. The feedforward layers in transformers use SwiGLU (Swish-gated Linear Unit) in many recent architectures. Activation choice affects training dynamics, gradient flow, and final model quality. The right activation depends on architecture, task, and scale, what works for small models may not be optimal for large ones.