The Overfitting Problem

an illustration of the overfitting problem where models learn too specifically in the training data and cannot generalize to new data.

So, I am here and let me fulfill my promise in the last post. Do you remember? We performed quadratic regression on the auto-mpg data and get a much better R2 than a regular linear model. So, we excitedly tried a cubic and bi-quadratic, and their R2’s… let us just say they were nowhere as good. What happened back then? Well, the issue that we are looking at is called the overfitting problem – a model learns too specific patterns in the training data and fails to generalize to new one. So, let us wait no longer, and jump in!

Model complexity and overfitting

Let us start with the simple linear regression case where we have the model equation y = ax + b. Training this model means to estimate the coefficients a and b for us. So far, we have just used SKLearn for that, so let us talk more how the training actually happens. Very roughly speaking, the optimization software (SKLearn in this case) tries to determine the specific values of a and b so that the model yields the lowest MSE in the training data.

Illustrative example

For example, with the data on the right, a and b can be anything. Let try a few values
– A model, a=1 and b=4 leads to an MSE of 2
– Another one, a=2 and b=3 gets an MSE of 5
– Lastly, a=0.21 and b=7.29 gives 0.6
So, the last set of values for a and b gets the lowest training MSE and is selected as the model equation. They are actually the best a linear model can do for this data.

xy
58
39
27

So now, the interesting issue occurs. Let us not use a linear model but a quadratic model y = a0 + a1x + a2x2. This is actually equivalent to fit a linear model on a data with two features, x and x2 as below

With this data, we can find the model y = -2 + 6.17x + 0.83x2 that yields an MSE at 0, meaning it has perfect prediction! But why is that? The reason is that, with the squared term, our model can now represent a curve that perfectly goes through all the three data points. Below are the scatter plots:

xx2y
5258
399
247
an illustration of a good model
an illustration of an overfitting model

The overfitting problem

As it turns out, the more complicated a model, the better it can fit its training data. However, this is not always a good thing. A model with too much representation capability can start “imagine” patterns that are not there in the data. It then trains itself to fit very specifically with these fake patterns and gets a very small MSE in its training data. However, by doing that, the model loses it generalization capability and fails to understand any new data that it has not seen. This issue is called the overfitting problem.

Let us take a look at the two scatter plots below. On the left side, we have a linear model. Only able to fit a straight line on the training data, it does its best, and generalize okay to the new data. However, the model on the right side is a polynomial with degree of 5 (hence the four “bumps”). It fits into the training data much better than the linear one, however, totally fail to adapt to the new data. In this case, the polynomial model try to fit fake patterns from noises, not from the true relationship between the target and the feature, which led to it overfitting the training data.

an illustration of a good model

linear model

an illustration of an overfitting model

degree-5 polynomial model

Demonstration

Let us train a few models to demonstrate the overfitting problem. We are still investigating the auto-mpg data, so loading and the first part of our pipeline are the same. Please refer to the previous post if you want to review the data and exploratory analysis. You can also download the complete notebook here. For convenient purpose, I keep the categorical pipeline here since we only change the numeric pipeline.

Click to expand

Linear model

First, let us fit a linear model. The numeric pipeline only consists of imputation and standardization in this case. Instead of using cross-validation evaluation, we will get the R2 from the training and testing data to observe how they change with the complexities of the models. The linear model get a training R2 of 0.836 and testing R2 0.776.

Quadratic model

We repeat most of the code for linear model, only add a step for PolynomialFeatures(degree=2) after imputation in the numeric pipeline (and of course update all necessary variable names!). This time, our model gets a training R2 of 0.892 and testing R2 0.812.

Cubic and bi-quadratic models

These two are exactly like the quadratic model, so I am hiding the codes. Just expand them if you want to look at. In terms of results, the cubic model gets a training R2 of 0.935 and testing R2 0.553, and the bi-quadratic model 0.959 and… -411.12…

Click to expand

R2’s of cubic model:

Click to expand

R2’s of bi-quadratic model:

Result discussion

Let us have a table to clearly see the pattern. So, we can clearly see that, the training R2 is definitely increasing with the models’ complexities, from 0.836 at linear to 0.956 at degree-4 polynomial. On the other hand, the testing R2 caps very quickly at the quadratic model, and drops drastically after that. The bi-quadratic model gets a testing R2 of -411.12 which means it is just so very wrong in predicting the testing data.

Linear QuadraticCubic bi-Quad
training R20.8360.8920.9350.956
testing R20.7760.8120.553-411.12

Clearly, the cubic and bi-quadratic models are overfitting. However, we can also see sights of overfitting in the quadratic model – the difference between the training R2 and the testing R2 is quite large. By the way, this is a good way to check if your models are overfitting. Just examine their performances in the training and the testing data, if the training performance is largely better, your models have overfitted.

Wrapping up

The overfitting problem is actually very common in predictive analysis. Any models can overfit, not just high degree polynomial ones. So, it is always good to check training/testing performances besides cross-validation results. Now, this post turned out to be much longer than I anticipated. And I even planned to discuss how to fix model overfitting here. But let us just take a rain check on that. So, see you in the next post then!

1 Comment

Comments are closed