an illustration of the overfitting problem where models learn too specifically in the training data and cannot generalize to new data.

So, I am here and let me fulfill my promise in the last post. Do you remember? We performed quadratic regression on the auto-mpg data and get a much better R2 than a regular linear model. So, we excitedly tried a cubic and bi-quadratic, and their R2’s… let us just say they were nowhere as good. What happened back then? Well, the issue that we are looking at is called the overfitting problem – a model learns too specific patterns in the training data and fails to generalize to new one. So, let us wait no longer, and jump in!

Model complexity and overfitting

Let us start with the simple linear regression case where we have the model equation y = ax + b. Training this model means to estimate the coefficients a and b for us. So far, we have just used SKLearn for that, so let us talk more how the training actually happens. Very roughly speaking, the optimization software (SKLearn in this case) tries to determine the specific values of a and b so that the model yields the lowest MSE in the training data.

Illustrative example

For example, with the data on the right, a and b can be anything. Let try a few values
– A model, a=1 and b=4 leads to an MSE of 2
– Another one, a=2 and b=3 gets an MSE of 5
– Lastly, a=0.21 and b=7.29 gives 0.6
So, the last set of values for a and b gets the lowest training MSE and is selected as the model equation. They are actually the best a linear model can do for this data.

x	y
5	8
3	9
2	7

So now, the interesting issue occurs. Let us not use a linear model but a quadratic model y = a₀ + a₁x + a₂x². This is actually equivalent to fit a linear model on a data with two features, x and x_² as below

With this data, we can find the model y = -2 + 6.17x + 0.83x² that yields an MSE at 0, meaning it has perfect prediction! But why is that? The reason is that, with the squared term, our model can now represent a curve that perfectly goes through all the three data points. Below are the scatter plots:

x	x_²	y
5	25	8
3	9	9
2	4	7

The overfitting problem

As it turns out, the more complicated a model, the better it can fit its training data. However, this is not always a good thing. A model with too much representation capability can start “imagine” patterns that are not there in the data. It then trains itself to fit very specifically with these fake patterns and gets a very small MSE in its training data. However, by doing that, the model loses it generalization capability and fails to understand any new data that it has not seen. This issue is called the overfitting problem.

Let us take a look at the two scatter plots below. On the left side, we have a linear model. Only able to fit a straight line on the training data, it does its best, and generalize okay to the new data. However, the model on the right side is a polynomial with degree of 5 (hence the four “bumps”). It fits into the training data much better than the linear one, however, totally fail to adapt to the new data. In this case, the polynomial model try to fit fake patterns from noises, not from the true relationship between the target and the feature, which led to it overfitting the training data.

linear model

degree-5 polynomial model

Demonstration

Let us train a few models to demonstrate the overfitting problem. We are still investigating the auto-mpg data, so loading and the first part of our pipeline are the same. Please refer to the previous post if you want to review the data and exploratory analysis. You can also download the complete notebook here. For convenient purpose, I keep the categorical pipeline here since we only change the numeric pipeline.

Click to expand

In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = pd.read_csv('auto-mpg.csv')

train, test = train_test_split(data, test_size=0.2)

num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

#pipeline for class features
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])

Linear model

First, let us fit a linear model. The numeric pipeline only consists of imputation and standardization in this case. Instead of using cross-validation evaluation, we will get the R2 from the training and testing data to observe how they change with the complexities of the models. The linear model get a training R2 of 0.836 and testing R2 0.776.

In [2]:

#pipeline for numeric features
num_pipeline_linear = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('standardize', StandardScaler())
])

#full data pipeline
data_pipeline_linear = ColumnTransformer([
    ('numeric', num_pipeline_linear, num_cols),
    ('class', cat_pipeline, cat_cols)
])

#model pipeline
linear_reg_pipeline = Pipeline([
    ('processing', data_pipeline_linear),
    ('modeling', LinearRegression())
])

In [3]:

linear_reg_pipeline.fit(train, train[[target]])

print('training R2:', linear_reg_pipeline.score(train, train[[target]]))
print('testing R2:', linear_reg_pipeline.score(test, test[[target]]))

training R2: 0.835723005971064
testing R2: 0.7760959508464778

Quadratic model

We repeat most of the code for linear model, only add a step for PolynomialFeatures(degree=2) after imputation in the numeric pipeline (and of course update all necessary variable names!). This time, our model gets a training R2 of 0.892 and testing R2 0.812.

In [4]:

#pipeline for numeric features
num_pipeline_poly2 = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('quadratic features', PolynomialFeatures(degree=2)),
    ('standardize', StandardScaler())
])

#full data pipeline
data_pipeline_poly2 = ColumnTransformer([
    ('numeric', num_pipeline_poly2, num_cols),
    ('class', cat_pipeline, cat_cols)
])

#model pipeline
poly2_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly2),
    ('modeling', LinearRegression())
])

In [5]:

poly2_reg_pipeline.fit(train, train[[target]])

print('training R2:', poly2_reg_pipeline.score(train, train[[target]]))
print('testing R2:', poly2_reg_pipeline.score(test, test[[target]]))

training R2: 0.8920039412387385
testing R2: 0.8115434439880399

Cubic and bi-quadratic models

These two are exactly like the quadratic model, so I am hiding the codes. Just expand them if you want to look at. In terms of results, the cubic model gets a training R2 of 0.935 and testing R2 0.553, and the bi-quadratic model 0.959 and… -411.12…

Click to expand

In [6]:

#pipeline for numeric features
num_pipeline_poly3 = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('quadratic features', PolynomialFeatures(degree=3)),
    ('standardize', StandardScaler())
])

#full data pipeline
data_pipeline_poly3 = ColumnTransformer([
    ('numeric', num_pipeline_poly3, num_cols),
    ('class', cat_pipeline, cat_cols)
])

#model pipeline
poly3_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly3),
    ('modeling', LinearRegression())
])

R2’s of cubic model:

In [7]:

poly3_reg_pipeline.fit(train, train[[target]])

print('training R2:', poly3_reg_pipeline.score(train, train[[target]]))
print('testing R2:', poly3_reg_pipeline.score(test, test[[target]]))

training R2: 0.9353098210465611
testing R2: 0.5525286388346851

Click to expand

In [8]:

#pipeline for numeric features
num_pipeline_poly4 = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('quadratic features', PolynomialFeatures(degree=4)),
    ('standardize', StandardScaler())
])

#full data pipeline
data_pipeline_poly4 = ColumnTransformer([
    ('numeric', num_pipeline_poly4, num_cols),
    ('class', cat_pipeline, cat_cols)
])

#model pipeline
poly4_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly4),
    ('modeling', LinearRegression())
])

R2’s of bi-quadratic model:

In [9]:

poly4_reg_pipeline.fit(train, train[[target]])

print('training R2:', poly4_reg_pipeline.score(train, train[[target]]))
print('testing R2:', poly4_reg_pipeline.score(test, test[[target]]))

training R2: 0.9586116405583816
testing R2: -411.12073625815236

Result discussion

Let us have a table to clearly see the pattern. So, we can clearly see that, the training R2 is definitely increasing with the models’ complexities, from 0.836 at linear to 0.956 at degree-4 polynomial. On the other hand, the testing R2 caps very quickly at the quadratic model, and drops drastically after that. The bi-quadratic model gets a testing R2 of -411.12 which means it is just so very wrong in predicting the testing data.

	Linear	Quadratic	Cubic	bi-Quad
training R2	0.836	0.892	0.935	0.956
testing R2	0.776	0.812	0.553	-411.12

Clearly, the cubic and bi-quadratic models are overfitting. However, we can also see sights of overfitting in the quadratic model – the difference between the training R2 and the testing R2 is quite large. By the way, this is a good way to check if your models are overfitting. Just examine their performances in the training and the testing data, if the training performance is largely better, your models have overfitted.

Wrapping up

The overfitting problem is actually very common in predictive analysis. Any models can overfit, not just high degree polynomial ones. So, it is always good to check training/testing performances besides cross-validation results. Now, this post turned out to be much longer than I anticipated. And I even planned to discuss how to fix model overfitting here. But let us just take a rain check on that. So, see you in the next post then!

The Overfitting Problem

Model complexity and overfitting

Illustrative example

The overfitting problem

Demonstration

Linear model

Quadratic model

Cubic and bi-quadratic models

Result discussion

Wrapping up

1 Comment