blog banner for post titled: What is Regularization in Machine Learning?

Regularization helps to solve the problem of overfitting in machine learning.

How well a model fits training data determines how well it performs on unseen data. Poor performance can occur due to either overfitting or underfitting the data. Overfitting is a phenomenon where the model accounts for all of the points in the training dataset, making the model sensitive to small fluctuations in the training set. We refer to this model as having high variance and is likely to have fit to random noise in the data. Noise refers to data points in the dataset, which do not represent the true properties of the data but occur due to random chance. Regularization simplifies an overfit model by penalizing coefficient estimates in the linear model loss function. This article will go through the types of regularization, including LASSO or L1 Regularization, Ridge or L2 Regularization, and Elastic Net.

How does Overfitting Occur?

Overfitting means the model matches the data “too closely” and learns from noise in the data rather than the actual signal. Overfitting can occur when exposing a model to the same data too many times or creating a too complex model. The figure below shows an example of a fit to a dataset with two classes.

Examples of fitting a sigmoid function to a 2-dimensional dataset
Examples of fitting a sigmoid function to a 2-dimensional dataset. Source: Me

The first figure shows an example of a model that is too simplistic or has not seen the data a sufficient number of times during training. The model cannot find structure in the data, therefore does not fit properly and will not perform well on unseen data. We refer to the scenario where a machine learning model can not learn relationships between variables in the data as underfitting.

The second figure shows an example of a model with an appropriate level of complexity or has seen the data sufficient times during training. Our model is a good fit when it can find all the necessary patterns in the data and is not susceptible to spurious patterns or noise.

The third figure shows an example of a model that is too complex or has seen the model too many times during training and found patterns in the data unnecessary for fitting. We refer to a scenario where a model attempts to fit each data point, including noise as overfitting. An overfit model will not predict very well on unseen data.

What is Bias and Variance Tradeoff?

A Bias occurs when a model has limited flexibility to learn from data. Think about bias as an opinion that is not based on any data or patterns in the data. For example, in the underfit model above, the model predicts a linear relationship, even though the data implies otherwise. That model is therefore biased to a linear relationship. A model with high bias pays little attention to the training data and will have a high error on both training and test data. High bias causes underfitting in a model.

Variance measures how sensitive a model is to the variability of the training data. A model with high variance is sensitive to all the variations and quirks of the data that are useful for learning the training data but not for performing on unseen data. Models with high variance perform very well on training data but will have high error rates on test data. High variance causes overfitting in a model.

During model optimization, we aim to balance bias and variance. We want the model to learn patterns, but we also want it to be able to generalize to new data. In other words, we want the model to learn enough that it does not have an uninformed bias and ensures the model does not learn too much about the data such that it cannot adapt to anything new. We refer to this balancing act as the bias-variance trade-off. It is typically impossible to optimize for both simultaneously. Therefore we also refer to a bias-trade off dilemma. Handling bias and variance is equivalent to handling over and underfitting. Let’s look at a visual representation of how bias and variance changes with model complexity.

Bias and variance contributing to total error.
Figure showing bias and variance contributing to the total error. Source

Here, the total error is the sum of the bias and variance errors. As we add more parameters to the mode, the complexity rises and the variance and total error. Variance becomes the primary concern while bias falls. The model would have a low validation error (tested on data from the training set) but a high prediction error (tried on unseen data). If we remove parameters, model complexity reduces, and so does variance. If we simplify beyond optimal model complexity, the bias rises, and so does the total error. The model would have a high validation error and a high prediction error in this case.

The best place for any model is the level of complexity at which the increase in bias is equivalent to the reduction in variance. We can describe this mathematically as:

If our model exceeds the best place, then we are overfitting our model. If the model complexity falls short of the sweet spot, we are under-fitting the model. To determine the best level of complexity, we have to use the prediction error for differing levels of model complexity, then choose the level of complexity that minimizes the overall error.

What is Regularization?

Regularization refers to the collection of techniques used to tune machine learning models by minimizing an adjusted loss function to prevent overfitting. Using regularization, we are simplifying our model to an appropriate level such that it can generalize to unseen test data.

Below you can see a visual example of overfitting and the effect of regularization to produce a good fit.

Effect of regularization of overfit data to give a good-fit
Effect of regularization on an overfit model to give a good fit. Source: Me

How does Regularization Work?

Linear model to with p coefficients to fit to data

Where Y represents the dependent feature, X_{1}, X_{2},..., X_{p} are the independent features or predictors for Y, and \beta_{0}, \beta_{1},..., \beta_{n} represent the estimates on the coefficients for the different variables or predictors X, \beta_{0} represents the intercept.

Linear regression aims to optimize the coefficients and intercept to find a good fit for the data. The loss function is the average squared difference between the known true y_{i} and our predictions \hat{y}_{i}. We refer to this function as the residual sum of squares or RSS.

Residual sum of squares equation
Residual sum of squares equation

When we substitute the \hat{y}_{i} values with the coefficients into the loss function, we get the equation that describes fitting a linear model:

Residual sum of squares equation with coefficient substitution
Residual sum of squares equation with coefficient substitution

We can use another summation term for the coefficient estimates to simplify the equation.

Loss Function for Simple Linear Regression
Loss Function for Simple Linear Regression

The fitting procedure involves choosing coefficients that minimize the loss function. We adjust the coefficients estimates using the training data. If the model is overfitting, the coefficients of the parameters are extreme, and the model will not generalize well to future data. Regularization shrinks or regularizes the coefficient estimates to avoid this.

Types of Regularization

LASSO Regression

LASSO (Least Absolute and Selection Operator) regression is a regularization technique for linear models. LASSO regression is also called L1 regularization. Using LASSO regression, we introduce a penalty term to the RSS function that contains only the absolute coefficient values.

Loss Function for Lasso Regression
Loss Function for Lasso Regression

Because we are using absolute values of the coefficients, we can set the tuning parameter \lambda to a sufficiently large value forcing some of the coefficient estimates to exactly zero. Let’s look at how different values of \lambda impact the coefficient estimates in LASSO regression.

  • \lambda = 0: The penalty term has no effect and the coefficient estimates will be equal to the residual sum of squares.
  • \lambda = \infty: Implies no feature is considered, as \lambda approaches \infty more and more coefficient estimates are set to zero

Use cases for Lasso Regression

LASSO regularization favours simple, sparse models or models with fewer parameters. LASSO regression is useful for automating certain parts of model selection, including feature selection or elimination.

Limitations of Lasso Regression

  • LASSO regression is suited for automation, therefore LASSO may eliminate certain parameters regardless if they have high relevance, that would be kept in with human input.
  • If the number of predictiors, p, is greater than the number observations, n, Lasso will pick at most n predictors as non-zero, even if all of the predictors are relevent or may be used in the test set.
  • If there are two or more highly collinear variables, then LASSO regression will seclect on of them randomly, which will llimit the interpretability of the data.

Ridge Regression

Ridge regression, also known as L2 regularization, introduces a small amount of bias to the model called the Ridge Regression penalty. We calculate the penalty term by multiplying \lambda by the squared coefficients of each predictor. The equation for the cost function in Ridge regression is, therefore:

Loss Function for Ridge Regression
Loss Function for Ridge Regression

\lambda is a tuning parameter that decides how much we want to penalize our model. Tuning the parameter balances the amount of emphasis given to minimizing the RSS part of the Ridge regression loss function versus minimizing the sum of the square of coefficients. Let’s look at how different values of \lambda impact the coefficient estimates in Ridge regression.

  • \lambda = 0: The penalty term has no effect and the coefficient estimates will be equal to the residual sum of squares.
  • \lambda = \infty: The coefficient estimates will tend to zero in order to minimize the coefficient squared term. In other words, we ignore the core RSS function and focus on mimizing the penalty term.
  • 0 < \lambda < \infty: For simple linear reression, the coefficient estimates are between 0 and 1.

Ridge regression regularizes by a factor proportional to the RSS because the regularization term is also squared. Ridge regression can shrink the coefficients but cannot eliminate them in practice.

Multicollinearity

Multicollinearity or collinearity is the existence of near-linear relationships among the independent variables in a dataset. Ridge regression is useful for multiple regression data that suffer from multicollinearity. Data with multicollinearity have unbiased least squares estimates with significant variances. Ridge regression introduces bias to reduce the errors on the estimates to make them more reliable.

Use cases for Ridge Regression

  • Ridge reression is beneficial when the number of predictors, p, is large than the number of observations, n.
  • Ridge regression is very effective for coefficient estimation where there is a high degree of multicollinearity.

Limitations of Ridge Regression

  • Ridge regression decreases model complexity but does not reduce the number of features as no coefficient is set to zero as \lambda increases. Therefore this type of reqgression is not suitable for feature selection.
  • It does not produce sparse models, as all variables are included in the final model

Key Differences Between Lasso and Ridge Regression

We can use Lasso regression to eliminate coefficients, whereas Ridge regression shrinks coefficients. Lasso is useful for feature selection, as it informs us on which variables we can drop that have coefficients that go to zero. Ridge regression is useful when we have collinear or codependent features. Codependence tends to increase coefficient variance, which will reduce the model’s generalizability. Ridge regression introduces bias, reducing the variance of the coefficient estimates to counteract the effect of codependencies.

We can discuss LASSO and Ridge regression from the perspective of trying to solve an equation, where the sum of squares of the coefficients is less than or equal to a constant. LASSO is an equation where the summation of the modulus of the coefficients is less than or equal to s. We can also interpret Ridge regression as an equation where the summation of the coefficient squares is less than or equal to s:

We refer to these equations as constraint functions. Let’s simplify the equations and only consider two parameters\beta_{1} and \beta_{2}. With the above equation definitions, we can express LASSO regression as:

Constraint function for LASSO regression
Constraint function for LASSO regression

This equation implies the LASSO coefficient estimates minimize the loss function for all points that lie within the bounds of the diamond given by |\beta_{1}| + |\beta_{2}| \leq s.

We can express Ridge regression as:

Constraint function for Ridge regression
Constraint function for Ridge regression

The Ridge coefficient estimates minimize the loss function for all points that lie within the bounds of the circle given by \beta_{1}^{2} + \beta_{2}^{2} \leq s. To help solidify the idea, we can look at the visual representation of these equations:

LASSO and Ridge constratint function with RSS ellipse intersection
LASSO and Ridge constraint function with RSS ellipse intersection. Modified from source

The green areas on the figure describe the constraint functions for LASSO on the left and Ridge on the right. The red contours describe the RSS. Points on the red ellipses share the values of the RSS. If s is very large, the green region will expand and contain the center of the ellipse. In this case, the coefficient estimates for both regression techniques are equal to the RSS estimates. The case shown in the figure is when the LASSO and Ridge regression coefficient estimates are given by the point where the ellipse meets the constraint region. The intersection will not generally occur on an axis for the Ridge constraint, and therefore the coefficient estimates will be non-zero. For the LASSO constraint, there are corners at each of the axes, the ellipse will intersect at an axis, and one of the coefficients will equal zero. If we expand to higher dimensions than two, we can have multiple coefficient estimates equal to zero simultaneously.

Elastic Net Regularization

Elastic net regularization is the best of both worlds that linearly combines the L1 and L2 penalty terms from the LASSO and Ridge regression methods.

Let’s consider the p >> n or high-dimensional data case. The LASSO regression method selects at most n variable before it saturates. Also, if there is high multicollinearity, LASSO tends to choose one variable and ignore others regardless of relevance. Elastic net adds a quadratic part to the penalty, which is the Ridge regression when used alone. The Elastic net loss function is as follows:

Loss function for Elastic Net
The loss function for Elastic Net

Where \alpha is the mixing parameter between Ridge (\alpha = 0) and LASSO (\alpha = 1). Elastic net is a weighted combination of LASSO and Ridge regression, with two tuning paramters \lambda and \alpha. The quadratic penalty term makes the loss function strongly convex. We can see this visually in the simplified constraint function figure below.

Constraint functions for Ridge, LASSO and Elastic Net
Constraint functions for Ridge, LASSO and Elastic Net for two parameters. Source: Me

When we look at the constraint functions for all three regression methods, we can see the Elastic Net falls in between Ridge and Lasso. The Elastic net constraint function also exhibits singularity at the vertices, which is important for creating sparse final models. The convexity of the Elastic net constraint function depends on the tuning parameter \alpha. Convexity is also dependent on the correlation of the variables selected. The higher the correlation of variables, the higher the number of variables included in the sample. Elastic net is appropriate for group variable selection in data with highly correlated independent variables. If one of the variables is selected, we also include the group of correlated variables. LASSO would otherwise ignore these correlated variables.

Other Ways to Prevent Overfitting

Several ways to prevent overfitting do not involve the loss function.

Cross-validation

  • Cross-validation involves splitting the initial traning data to generate mulitiple mini train-test splits.
  • Each of these data splits tune the model
  • In k-fold cross validation, we partition the data into k subsets, called folds, then iteratively train the algorithm on k-1 folds. We use the remaining fold, which is called the holdout fold, as the test set.
  • Cross-validation ensures we tune the model with the training dataset and do not expose the model to the unseen test data.

Use More Data

  • Does not work every time, but the more data we expose a model to, the more we force it to generalize to obtain results.
  • This method is expensive at it involves more data selection and cleaning.
  • You can use data augmentation, which involves making data points slightly different during training. This process makes the data appear unique during training and prevents the model from learning all the characteristics of the original dataset.

Ensemble Learning

  • Ensemble learning involves combining the predictions from multiple individually trained models.
  • The two most common methods for ensembling is boosting and bagging.
  • Boosting combines simple base models sequentially to increase their complexity. We can describe this as a group of week learners combined to make a strong learner.
  • Bagging involes training strong learners in parallel and combining them to optimize their preditions.
  • Bagging attempts to reduce the change of overfitting complex models, the combination of the strong learners help smooth out or simplify their predictions.

Summary

Congratulations on reading through this article! To summarise, regularization of linear models significantly reduces the variance in the model without heavily biasing it. Reducing the variance of a model makes it more generalizable to unseen data. Both regularization methods introduce a penalty term to the loss function to minimize the coefficient estimates. LASSO regression or L1 regularization allows us to eliminate parameters and produce simpler, sparser final models. Ridge regression performs coefficient shrinkage but not elimination and is suited for data with a high degree of multicollinearity. We can use the tuning parameter \lambda to reduce the values of the coefficient estimates in the modified loss function to prevent overfitting. Up to a point, the increase in \lambda is beneficial, beyond which the model will have too much bias and start to underfit. Therefore, selecting the value of \lambda is crucial for effective regularization. We can use cross-validation to find the optimal penalty terms sizes for the model’s best fit.

Both LASSO and Ridge have their limitations. Elastic net is a weighted combination of both regression methods that give us the best of both worlds in terms of the ability to minimize sparse models and group variable selection in multicollinear data.

For linear regression in Python, including Ridge, LASSO and Elastic Net, you can use the Scikit library. The R package for implementing regularized linear models is glmnet. To tune the Elastic Net in R, you can use caret.

To learn more about regularization to linear and non-linear models, go to the online courses page for Machine Learning.

For more discussion on bias-variance trade-off and linear regression, you can select one or more of the books I discuss in my blog post titled “The Best Books For Machine Learning for Both Beginners and Experts“.

Have fun and happy researching!