ML Fundamentals

Gradient Descent and Learning Rate

Gradient Descent is an iterative optimization algorithm used to minimize a model's cost function by adjusting parameters in the direction of the steepest descent.
The learning rate is a critical hyperparameter that dictates the size of the steps taken during each iteration of the optimization process.
Too high a learning rate causes divergence; too low leads to agonizingly slow convergence or getting stuck in local minima.
Modern variants like Adam or RMSprop adapt the learning rate dynamically, reducing the burden of manual tuning.

Why It Matters

Large Language Models

Gradient Descent is the backbone of training LLMs like those from OpenAI or Google. With billions of parameters, optimizers such as Adam navigate a high-dimensional space to minimise cross-entropy loss. Without efficient gradient-based methods, training these models would be computationally impossible.

Financial Risk Modelling

Companies like JPMorgan Chase use gradient-based optimization for algorithmic trading and risk assessment. Gradient descent refines predictive models against historical market data, with careful learning rate tuning ensuring models adapt to changing conditions without overfitting to noise.

Diagnostic Imaging

Healthcare models from Siemens Healthineers use gradient descent to train CNNs that identify anomalies in X-rays and MRIs. Careful learning rate tuning ensures the model generalises well to new patients rather than memorising training images.

How it Works

The Mountain Descent Analogy

Imagine you are standing on top of a foggy mountain range, and your goal is to reach the lowest point of the valley. Because of the thick fog, you cannot see the path ahead. However, you can feel the slope of the ground beneath your feet. The most logical strategy is to feel the direction of the steepest downward slope and take a step in that direction. After each step, you pause, re-evaluate the slope, and take another step. This is the essence of Gradient Descent. In machine learning, the "mountain" is our cost function, the "location" represents our model's current parameters (weights), and the "step" we take is determined by the learning rate.

The Mechanics of Optimization

Gradient Descent works by iteratively updating model parameters to minimize the error. We start with random weights and calculate the gradient — the direction that increases the error the fastest. By taking a step in the opposite direction (the negative gradient), we move toward lower error. The size of this step is crucial. If the step is too large, we might overshoot the valley floor and end up on the other side, potentially climbing higher. If the step is too small, we will take an eternity to reach the bottom, and we might get stuck in a shallow "puddle" (a local minimum) that isn't the true bottom of the valley.

Challenges in High-Dimensional Landscapes

In real-world machine learning, we aren't just dealing with a two-dimensional mountain; we are dealing with thousands or millions of dimensions. The "landscape" of a deep neural network is incredibly complex, filled with plateaus, narrow canyons, and saddle points. A saddle point is a region where the gradient is zero, but it is not a minimum; it is a point where the slope flattens out. Standard Gradient Descent often struggles here because the gradient becomes very small, causing the training process to stall. This is why we use advanced techniques like momentum, which helps the optimizer "roll" through flat regions by accumulating velocity from previous steps, effectively pushing the model through areas where the gradient is weak.

Common Pitfalls

"Gradient descent always finds the global minimum." False — it is a local optimizer. In non-convex landscapes it can get stuck in a local minimum or saddle point, which is why techniques like random restarts or momentum are used.
"A smaller learning rate is always better." Not true. Too small and the model trains agonizingly slowly and may get stuck in a local minimum that a larger step could have jumped over.
"The gradient is the same as the cost." The gradient is the derivative of the cost function. The cost tells you how bad the model is; the gradient tells you how to change parameters to improve it.
"You only need to set the learning rate once." Modern practice uses learning rate schedulers that reduce the rate over time — large steps early, fine-tuning near convergence.

Sample Code

Gradient Descent on a simple linear regression problem using NumPy:

Python

import numpy as np

# Data: y = 4 + 3x  (intercept=4, slope=3)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Parameters
theta = np.random.randn(2, 1)  # Random initialization
learning_rate = 0.1
iterations = 1000
m = len(X)

# Add bias term (x0 = 1)
X_b = np.c_[np.ones((100, 1)), X]

for iteration in range(iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - learning_rate * gradients

print(f"Final parameters: {theta.flatten()}")
# Output: Final parameters: [4.023, 2.981] (approximate)

Key Terms

Cost Function

A formula that quantifies the error between the model's predictions and actual targets — the "objective" the optimizer seeks to minimize.

Gradient

A vector of partial derivatives pointing in the direction of steepest increase. We move in the negative direction to reduce the cost function.

Learning Rate

A scalar hyperparameter controlling how much we adjust model weights with respect to the gradient — the "step size" per iteration.

Convergence

The state where the cost function is no longer decreasing significantly and model parameters are considered optimized.

Local Minima

A point lower than all surrounding points but not the absolute lowest. Gradient descent can get trapped here.

Hyperparameter

A configuration value external to the model that must be set before training begins, such as the learning rate or batch size.

Epoch

One complete pass through the entire training dataset. Multiple epochs are usually required for the model to learn effectively.