Residual Connection | Glossary

A residual connection, also called a skip connection, adds the input of a layer directly to its output, allowing information and gradients to flow through the network without passing through every transformation. If a layer computes f(x), the output with a residual connection is x + f(x). The layer learns only the 'residual': how much to modify the input, not the full output.

This seemingly simple change enables training networks hundreds of layers deep. Without residual connections, gradients must flow through every layer during backpropagation. Each layer's gradients multiply together, often causing exponential decay (vanishing gradients) or growth (exploding gradients).

Residual connections provide a direct gradient highway: gradients can flow straight from output to input, bypassing intermediate layers. Deep networks become trainable because gradients no longer must survive the multiplicative chain. Introduced in ResNet (2015), residual connections enabled the jump from tens to hundreds of layers, greatly improving image classification.

Every transformer block uses residual connections: attention output is added to its input, feedforward output is added to its input. This structure allows very deep transformers (GPT-3 has 96 layers) to train stably. The identity mapping perspective suggests residual networks learn increasingly refined transformations, with each layer making small adjustments rather than complete rewrites.

Interactive Concept: residual connection

Residual Connection Visualizer

Interactive comparison of networks with and without skip connections

Controls

Input Value5

Network Depth

Without Residual Connections

×1.0

Final output: 4

With Residual Connections

skip

Final output: 9

Key Insight: Residual connections (x + f(x)) preserve information flow and prevent gradient vanishing. Notice how the output degrades more slowly with residual connections, and gradients remain strong even in deep networks.

Related Terms

Transformer Neural Network