Gradient descent is the optimization algorithm that trains neural networks by iteratively adjusting parameters in the direction that reduces the loss function. The gradient of the loss function points toward the steepest increase in error. By moving parameters in the opposite direction by a small step called the learning rate, the algorithm descends toward lower loss. Repeat until the loss stops decreasing significantly. Batch gradient descent computes the gradient using the entire training set, which is accurate but slow for large datasets. Stochastic gradient descent uses a single random example per step, which is noisy but fast. Mini-batch gradient descent, the practical default, uses small batches of examples, balancing accuracy and speed while enabling GPU parallelization. The learning rate is critical: too large and the algorithm overshoots minima, oscillating or diverging; too small and training takes forever or gets stuck in local minima. Learning rate schedules decrease the rate during training, taking large steps early and fine-grained steps later. Adam, AdaGrad, and RMSprop are adaptive optimizers that maintain per-parameter learning rates, automatically adjusting based on gradient history. Momentum adds inertia, helping escape local minima and smooth noisy gradients. Modern deep learning relies on these gradient descent variants to navigate loss landscapes with billions of parameters.
Back to Glossary