an illustration of model regularization in which models are restricted on where they can get to ensure generalizability on data

Previously, we have discussed the overfitting problem where a model overlearns the training data and fails to generalize. There are a lot of potential issues here. First, overfitting is obviously bad because it makes our model… bad, and should be avoided. However, we saw in our small experiment, a linear model could not handle the nonlinearity and underfitted, but a quadratic model overfitted. There are no 1.5-degree polynomial models for us to use here (well there are options, but they are manual and hard to be automated). So, do we have to choose between underfitting and overfitting? No, we do not! There are ways to let have models get to the necessary complexity to deal with nonlinearity while not overfitting data. In this post, we will discuss one of such methods, which is model regularization.

So what causes overfitting, really?

As we learned in the last post, what happens during model training is that the optimization process tries to determine the specific values of the model coefficients to obtain lowest training MSE. For example, with a linear model y = ax + b, SKLearn tries to find a and b so that the training MSE is as low as possible. As the polynomial degree grows, we have more coefficients, for example, a quadratic model has three y = a₀ + a₁x + a₂x², a cubic model four y = a₀ + a₁x + a₂x² + a₃x³. The increase in numbers of coefficients is more rapidly the more features we have.

Now, we also discussed how having more complexity also means that our models become more flexible. At some points, they become too flexible and try to fit into fake patterns just so the training MSE is the lowest. Below is one example. On the same data, a linear model fits a straight line the, a polynomial model at degree 10 fits a very odd curve, and at degree 20, the model straight up creates imaginary patterns by just connecting some instances.

linear model

degree-10 polynomial

a degree 20 model with an extremely bad fit

degree-20 polynomial

So, the root of the problem here is that, the model are given too much freedom to fit anything it sees in the training data. And what is the solution for too much freedom? Of course, take some of those away! Sorry, bad joke, I know. But for real, a strategy to deal with overfitting is to restrain what the model could be, or in a linear model, restrain the coefficients from growing too extreme. And this is the main idea behind model regularization.

Model regularization

In general, model regularization means to control the model complexity while balancing it with the training error. The idea is, the model should be complex just enough. Different models have different way of regularizing. So far, we have only learned linear regression, so let us talk about regularizing this model.

Let us have a general data set of k features (that could be independent or created from polynomial or interactions) x₁, x₂, to x_k. A linear model fitted on this data has the equation

$y = a_0 + a_1x_1 + a_2x_2 + ... + a_kx_k$

And, training this model means to find a set of values for a₀, a₁, a₂, to a_k so that the training MSE is the lowest. Minimizing the training MSE is called the training objective of this model which, so far, has not had any controls over the values of a₀ to a_k. To incorporate some restrictions on the coefficients is actually not difficult, we can add a penalty term to the training objective

minimize $training\_MSE + penalty$

With penalty increases as the model becomes more complex. One way to define the penalty is to have penalty = a₀² + a₁² + ... a_k². With this penalty, as any coefficients get very large in scale (either positive or negative), the training objective significantly increases as well. This increase prevents the current set of coefficient values from yielding the minimum training objective, even if their training MSE is the lowest, leading them to not being selected. So, the optimal solution here must balance between having small enough MSE and reasonable values of coefficients. Overall, we have added constraints on the model and keep its complexity in check! This method that I have just described here is a simple version of Ridge regression.

Ridge regression demonstration

I hope my explanation about regularization and ridge regression is understandable. Now, let us move on to some demonstration. Let us still use the auto-mpg data and see if we can really solve the previous issue now. As usual, I will hide the part with loading data, split train-test, and building the categorical pipeline since it is just like before. Also, you can get the complete notebook here.

In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = pd.read_csv('auto-mpg.csv')

train, test = train_test_split(data, test_size=0.2)

num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

#pipeline for class features
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])

Linear model

We will start with fitting a regular linear regression model for references. It gets a CV R2 of 0.806, which we will use as our baseline for other models.

In [7]:

num_pipeline_linear = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('standardize', StandardScaler()),
])

data_pipeline_linear = ColumnTransformer([
    ('numeric', num_pipeline_linear, num_cols),
    ('class', cat_pipeline, cat_cols)
])

linear_reg_pipeline = Pipeline([
    ('processing', data_pipeline_linear),
    ('modeling', LinearRegression())
])

r2_10cv = cross_val_score(linear_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[7]:

0.8060313624814313

Cubic model

Remember how the cubic model was really struggling with overfitting? I will redo that here to get the CV R2 of the current split, which is 0.728, significantly lower than the baseline 0.806 and is obviously overfitting.

In [2]:

#pipeline for numeric features
num_pipeline_poly3 = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('polynomial', PolynomialFeatures(degree=3)),
    ('standardize', StandardScaler()),
])

#full processing pipeline
data_pipeline_poly3 = ColumnTransformer([
    ('numeric', num_pipeline_poly3, num_cols),
    ('class', cat_pipeline, cat_cols)
])

#model pipeline
poly3_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly3),
    ('modeling', LinearRegression())
])

poly3_reg_pipeline.fit(train, train[[target]])

Out[2]:

Pipeline(steps=[('processing',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('polynomial',
                                                                   PolynomialFeatures(degree=3)),
                                                                  ('standardize',
                                                                   StandardScaler())]),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                                   'acceleration', 'year']),
                                                 ('class',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder())]),
                                                  ['origin'])])),
                ('modeling', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('processing',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('polynomial',
                                                                   PolynomialFeatures(degree=3)),
                                                                  ('standardize',
                                                                   StandardScaler())]),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                                   'acceleration', 'year']),
                                                 ('class',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder())]),
                                                  ['origin'])])),
                ('modeling', LinearRegression())])

processing: ColumnTransformer

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='median')),
                                                 ('polynomial',
                                                  PolynomialFeatures(degree=3)),
                                                 ('standardize',
                                                  StandardScaler())]),
                                 ['cylinders', 'displacement', 'horsepower',
                                  'weight', 'acceleration', 'year']),
                                ('class',
                                 Pipeline(steps=[('encoder', OneHotEncoder())]),
                                 ['origin'])])

numeric

['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']

SimpleImputer

SimpleImputer(strategy='median')

PolynomialFeatures

PolynomialFeatures(degree=3)

StandardScaler

StandardScaler()

class

['origin']

OneHotEncoder

OneHotEncoder()

LinearRegression

LinearRegression()

In [3]:

from sklearn.model_selection import cross_val_score
r2_10cv = cross_val_score(poly3_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[3]:

0.7284627674958539

Now, we will try using Ridge regression on the same cubic transformed data. This time, we use the Ridge model class instead of LinearRegression and add one on top of the processing pipeline. Do you get amazed by the result? We now get a 0.853 CV R2, way higher than the baseline. The model no longer overfits and can even handle the nonlinearity now!

In [4]:

from sklearn.linear_model import Ridge

ridge_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly3),
    ('modeling', Ridge())
])

r2_10cv = cross_val_score(ridge_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[4]:

0.8531922643781689

Quadratic and bi-quadratic models

Just for sanity check, let us try Ridge regression on the quadratic and bi-quadratic data. The code is identical except for for polynomial degree and variable names, so it is hidden. Anyway, the quadratic model get a CV R2 or 0.852, and the bi-quadratic 0.856.

In [5]:

num_pipeline_poly2 = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('polynomial', PolynomialFeatures(degree=2)),
    ('standardize', StandardScaler()),
])

data_pipeline_poly2 = ColumnTransformer([
    ('numeric', num_pipeline_poly2, num_cols),
    ('class', cat_pipeline, cat_cols)
])

ridge_reg_pipeline_2 = Pipeline([
    ('processing', data_pipeline_poly2),
    ('modeling', Ridge())
])

r2_10cv = cross_val_score(ridge_reg_pipeline_2, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[5]:

0.8518732915328598

In [6]:

num_pipeline_poly4 = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('polynomial', PolynomialFeatures(degree=4)),
    ('standardize', StandardScaler()),
])

data_pipeline_poly4 = ColumnTransformer([
    ('numeric', num_pipeline_poly4, num_cols),
    ('class', cat_pipeline, cat_cols)
])

ridge_reg_pipeline_4 = Pipeline([
    ('processing', data_pipeline_poly4),
    ('modeling', Ridge())
])

r2_10cv = cross_val_score(ridge_reg_pipeline_4, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[6]:

0.8556905493636462

Which model should we use?

While CV R2 seems to increase with polynomial degrees, the difference is fairly negligible. Plus, they can be purely caused by the different CV splits that SKLearn performed. Given the tradeoff in model and data complexity, I would totally stay with the quadratic data and model. For your reference, the quadratic data has 31 features, cubic data 87 features, and bi-quadratic… 213 features! Below is the summary table for your convenience.

	Linear	Quadratic	Cubic	bi-Quad
CV R2	0.806	0.852	0.853	0.856
No. features	9	31	87	213

Conclusion

In this post, we have learned more about the cause of overfitting, and got to know one solution for that which is model regularization. Regularization is a very important technique in data analytics and machine learning, so please take your time to understand the concept. Furthermore, we have talked about and showcased one simplified version of Ridge regression. However, there is not just that in this model. So, we will continue in the next post. Until next time!

Model Regularization