Activation Function | Glossary

An activation function is a mathematical transformation applied to each neuron's output in a neural network, introducing the non-linearity required for learning complex patterns. Without activation functions, any sequence of linear transformations collapses to a single linear transformation, no matter how many layers you stack, the network remains a linear model.

Non-linear activations enable networks to approximate arbitrary functions, forming the theoretical basis for deep learning's power. ReLU (Rectified Linear Unit) returns max(0, x), zeroing negative values while passing positive values unchanged. Its simplicity and effectiveness made it the default for most deep learning.

However, ReLU suffers from 'dying neurons', neurons that output zero and stop learning. Variants like Leaky ReLU (small negative slope instead of zero) and Parametric ReLU address this. Modern transformers typically use GELU (Gaussian Error Linear Unit) or SiLU/Swish, which have smooth gradients and often improve performance.

The feedforward layers in transformers use SwiGLU (Swish-gated Linear Unit) in many recent architectures. Activation choice affects training dynamics, gradient flow, and final model quality. The right activation depends on architecture, task, and scale, what works for small models may not be optimal for large ones.

Interactive Concept: activation function

Activation Functions

Interactive visualization showing how activation functions introduce non-linearity essential for neural networks

Function Graph

f(x) = max(0, x)

Introduces non-linearity by zeroing negative inputs

Interactive Control

Input Value: 0.00

Input:0.000

Output:0.000

Related Terms

Transformer Neural Network