veda.ng
Back to Glossary

Residual Connection

A residual connection, also called a skip connection, adds the input of a layer directly to its output, allowing information and gradients to flow through the network without passing through every transformation. If a layer computes f(x), the output with a residual connection is x + f(x). The layer learns only the 'residual': how much to modify the input, not the full output. This seemingly simple change enables training networks hundreds of layers deep. Without residual connections, gradients must flow through every layer during backpropagation. Each layer's gradients multiply together, often causing exponential decay (vanishing gradients) or growth (exploding gradients). Residual connections provide a direct gradient highway: gradients can flow straight from output to input, bypassing intermediate layers. Deep networks become trainable because gradients no longer must survive the multiplicative chain. Introduced in ResNet (2015), residual connections enabled the jump from tens to hundreds of layers, greatly improving image classification. Every transformer block uses residual connections: attention output is added to its input, feedforward output is added to its input. This structure allows very deep transformers (GPT-3 has 96 layers) to train stably. The identity mapping perspective suggests residual networks learn increasingly refined transformations, with each layer making small adjustments rather than complete rewrites.