Gradient Descent and Learning Rate
- Gradient Descent is an iterative optimization algorithm used to minimize a model's cost function by adjusting parameters in the direction of the steepest descent.
- The learning rate is a critical hyperparameter that dictates the size of the steps taken during each iteration of the optimization process.
- Too high a learning rate causes divergence; too low leads to agonizingly slow convergence or getting stuck in local minima.
- Modern variants like Adam or RMSprop adapt the learning rate dynamically, reducing the burden of manual tuning.
Why It Matters
Gradient Descent is the backbone of training LLMs like those from OpenAI or Google. With billions of parameters, optimizers such as Adam navigate a high-dimensional space to minimise cross-entropy loss. Without efficient gradient-based methods, training these models would be computationally impossible.
Companies like JPMorgan Chase use gradient-based optimization for algorithmic trading and risk assessment. Gradient descent refines predictive models against historical market data, with careful learning rate tuning ensuring models adapt to changing conditions without overfitting to noise.
Healthcare models from Siemens Healthineers use gradient descent to train CNNs that identify anomalies in X-rays and MRIs. Careful learning rate tuning ensures the model generalises well to new patients rather than memorising training images.
How it Works
The Mountain Descent Analogy
Imagine you are standing on top of a foggy mountain range, and your goal is to reach the lowest point of the valley. Because of the thick fog, you cannot see the path ahead. However, you can feel the slope of the ground beneath your feet. The most logical strategy is to feel the direction of the steepest downward slope and take a step in that direction. After each step, you pause, re-evaluate the slope, and take another step. This is the essence of Gradient Descent. In machine learning, the "mountain" is our cost function, the "location" represents our model's current parameters (weights), and the "step" we take is determined by the learning rate.
The Mechanics of Optimization
Gradient Descent works by iteratively updating model parameters to minimize the error. We start with random weights and calculate the gradient — the direction that increases the error the fastest. By taking a step in the opposite direction (the negative gradient), we move toward lower error. The size of this step is crucial. If the step is too large, we might overshoot the valley floor and end up on the other side, potentially climbing higher. If the step is too small, we will take an eternity to reach the bottom, and we might get stuck in a shallow "puddle" (a local minimum) that isn't the true bottom of the valley.
Challenges in High-Dimensional Landscapes
In real-world machine learning, we aren't just dealing with a two-dimensional mountain; we are dealing with thousands or millions of dimensions. The "landscape" of a deep neural network is incredibly complex, filled with plateaus, narrow canyons, and saddle points. A saddle point is a region where the gradient is zero, but it is not a minimum; it is a point where the slope flattens out. Standard Gradient Descent often struggles here because the gradient becomes very small, causing the training process to stall. This is why we use advanced techniques like momentum, which helps the optimizer "roll" through flat regions by accumulating velocity from previous steps, effectively pushing the model through areas where the gradient is weak.
Common Pitfalls
- "Gradient descent always finds the global minimum." False — it is a local optimizer. In non-convex landscapes it can get stuck in a local minimum or saddle point, which is why techniques like random restarts or momentum are used.
- "A smaller learning rate is always better." Not true. Too small and the model trains agonizingly slowly and may get stuck in a local minimum that a larger step could have jumped over.
- "The gradient is the same as the cost." The gradient is the derivative of the cost function. The cost tells you how bad the model is; the gradient tells you how to change parameters to improve it.
- "You only need to set the learning rate once." Modern practice uses learning rate schedulers that reduce the rate over time — large steps early, fine-tuning near convergence.
Sample Code
Gradient Descent on a simple linear regression problem using NumPy:
import numpy as np
# Data: y = 4 + 3x (intercept=4, slope=3)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Parameters
theta = np.random.randn(2, 1) # Random initialization
learning_rate = 0.1
iterations = 1000
m = len(X)
# Add bias term (x0 = 1)
X_b = np.c_[np.ones((100, 1)), X]
for iteration in range(iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - learning_rate * gradients
print(f"Final parameters: {theta.flatten()}")
# Output: Final parameters: [4.023, 2.981] (approximate)