Gradient Descent | Glossary

Gradient descent is the optimization algorithm that trains neural networks by iteratively adjusting parameters in the direction that reduces the loss function. The gradient of the loss function points toward the steepest increase in error. By moving parameters in the opposite direction by a small step called the learning rate, the algorithm descends toward lower loss.

Repeat until the loss stops decreasing significantly. Batch gradient descent computes the gradient using the entire training set, which is accurate but slow for large datasets. Stochastic gradient descent uses a single random example per step, which is noisy but fast.

Mini-batch gradient descent, the practical default, uses small batches of examples, balancing accuracy and speed while enabling GPU parallelization. The learning rate is critical: too large and the algorithm overshoots minima, oscillating or diverging; too small and training takes forever or gets stuck in local minima.

Learning rate schedules decrease the rate during training, taking large steps early and fine-grained steps later. Adam, AdaGrad, and RMSprop are adaptive optimizers that maintain per-parameter learning rates, automatically adjusting based on gradient history. Momentum adds inertia, helping escape local minima and smooth noisy gradients.

Modern deep learning relies on these gradient descent variants to move through loss landscapes with billions of parameters.

Interactive Concept: gradient descent

Gradient Descent Optimization

Watch the algorithm minimize loss by following gradients down the loss landscape

Click anywhere to move • Contour lines show loss levels

Training Controls

Learning Rate: 0.10

Metrics

Loss:0.0000

Step:0

Position:(150, 150)

Gradient:(-0.010, 0.010)

How it works:

• Gradient points toward steepest increase
• Move opposite direction to minimize loss
• Learning rate controls step size
• Repeat until convergence

Related Terms

Backpropagation Hyperparameter