ML Fundamentals

L1 Lasso vs L2 Ridge

L2 Ridge regularization adds a penalty proportional to the square of the magnitude of coefficients, effectively shrinking them toward zero without eliminating them.
L1 Lasso regularization adds a penalty proportional to the absolute value of coefficients, which can drive some feature weights to exactly zero, performing automatic feature selection.
Ridge is preferred when you expect most features to contribute to the output, while Lasso is superior for high-dimensional datasets where sparsity is desired.
Elastic Net combines both L1 and L2 penalties to leverage the benefits of both, particularly when features are highly correlated.

Why It Matters

Genomics and Bioinformatics:

In gene expression studies, researchers often have thousands of potential genetic markers (features) but only a small number of samples. Lasso is frequently used here to identify a small subset of genes that are truly predictive of a disease, as it ignores the thousands of irrelevant markers that would otherwise introduce noise into the model.

Finance and Credit Scoring:

Financial institutions use Ridge regression when building credit risk models where hundreds of economic indicators are available. Since many of these indicators (like GDP, interest rates, and consumer confidence) are correlated, Ridge helps stabilize the model by distributing the weight across all features rather than picking one and ignoring the others, which leads to more consistent risk assessment.

Marketing Attribution:

Companies use Elastic Net (a combination of L1 and L2) to determine which marketing channels—such as social media ads, email campaigns, or TV spots—contribute most to sales. Because marketing data is often highly collinear, the L1 component helps eliminate ineffective channels, while the L2 component ensures that the remaining effective channels are weighted appropriately, providing a balanced view of campaign performance.

How it Works

The Intuition of Regularization

At the heart of machine learning is the desire to build models that generalize well. When we train a linear regression model, we minimize the sum of squared residuals. However, if our dataset has many features—or if those features are noisy—the model might try to "memorize" the training data by assigning large, erratic weights to specific features. This is overfitting. Regularization acts as a "budget" for our weights. It tells the model: "You can reduce your error, but every unit of weight you use costs money." By penalizing large weights, we force the model to find a simpler, more robust solution.

Ridge Regression (L2): The Gentle Shrinker

Ridge regression, or Tikhonov regularization, adds a penalty term equal to the square of the magnitude of the coefficients. Because the penalty is squared, large weights are penalized much more heavily than small ones. However, the penalty for a weight near zero is very small. Consequently, Ridge regression tends to shrink all coefficients toward zero, but it rarely makes them exactly zero. This is ideal when you have many features that are all potentially relevant, but you want to prevent any single feature from dominating the prediction due to noise.

Lasso Regression (L1): The Feature Selector

Lasso (Least Absolute Shrinkage and Selection Operator) uses the absolute value of the coefficients as the penalty. Unlike the squared penalty of Ridge, the absolute value penalty has a constant slope. As the optimization algorithm moves toward the minimum, the L1 penalty exerts a constant pressure to push weights toward zero. When a weight reaches zero, the L1 penalty effectively "locks" it there. This makes Lasso a powerful tool for feature selection; it automatically identifies the most important variables and discards the rest by setting their coefficients to zero.

Comparing the Two

The fundamental difference lies in the geometry of the constraint. If you visualize the "budget" of weights as a shape, L2 regularization creates a circular constraint (a hypersphere), while L1 regularization creates a diamond-shaped constraint (a hyper-octahedron). Because the diamond has sharp corners on the axes, the optimization process is much more likely to hit the constraint at an axis point, where one or more coefficients are exactly zero. This geometric property is why Lasso produces sparse models while Ridge produces dense models with small weights.

Common Pitfalls

"Lasso is always better because it performs feature selection." Lasso is only better if the underlying truth is indeed sparse. If the true model depends on many small effects, Lasso will discard important information, leading to higher bias than Ridge.
"Regularization is only for linear models." While L1 and L2 are most famous in linear regression, they are applied to almost all machine learning models, including neural networks (weight decay) and logistic regression. The principle of penalizing complexity remains universal.
"Increasing $\lambda$ always improves model performance." Increasing $\lambda$ reduces variance but increases bias. If you set $\lambda$ too high, you will underfit the data, resulting in a model that is too simple to capture the underlying patterns.
"L1 and L2 are the only types of regularization." While L1 and L2 are the most common, there are many others, such as Elastic Net (a mix of both), Dropout in neural networks, and early stopping. Each serves a different purpose in controlling model complexity.

Sample Code

Python

import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# Generate synthetic data: 100 samples, 20 features, only 5 are informative
X, y = make_regression(n_samples=100, n_features=20, n_informative=5, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ridge Regression (L2)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Lasso Regression (L1)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Output comparison
print(f"Ridge coefficients (non-zero): {np.sum(ridge.coef_ != 0)}")
print(f"Lasso coefficients (non-zero): {np.sum(lasso.coef_ != 0)}")
# Expected Output:
# Ridge coefficients (non-zero): 20
# Lasso coefficients (non-zero): 6

Key Terms

Regularization

A technique used to prevent overfitting by adding a penalty term to the loss function that discourages complex models. It effectively constrains the learning process, forcing the model to prioritize simpler patterns over noise.

Overfitting

A modeling error that occurs when a function is too closely fit to a limited set of data points. The model performs exceptionally well on training data but fails to generalize to unseen, independent data.

Sparsity

A property of a model or data where many of the coefficients or values are exactly zero. In machine learning, sparse models are often preferred because they are more interpretable and require less memory.

Bias-Variance Tradeoff

The fundamental tension in machine learning between a model's ability to minimize error on training data (bias) and its ability to generalize to new data (variance). Regularization increases bias slightly to achieve a significant reduction in variance.

Hyperparameter

A configuration setting used to tune the learning process, such as the strength of the regularization penalty. Unlike model parameters, hyperparameters are set before the training process begins and are typically optimized via cross-validation.

L-norm

A mathematical measure of the "length" or "magnitude" of a vector in a vector space. L1-norm is the sum of absolute values, while L2-norm is the square root of the sum of squared values.