Previously, we have discussed the overfitting problem where a model overlearns the training data and fails to generalize. There are a lot of potential issues here. First, overfitting is obviously bad because it makes our model… bad, and should be avoided. However, we saw in our small experiment, a linear model could not handle the nonlinearity and underfitted, but a quadratic model overfitted. There are no 1.5-degree polynomial models for us to use here (well there are options, but they are manual and hard to be automated). So, do we have to choose between underfitting and overfitting? No, we do not! There are ways to let have models get to the necessary complexity to deal with nonlinearity while not overfitting data. In this post, we will discuss one of such methods, which is model regularization.
So what causes overfitting, really?
As we learned in the last post, what happens during model training is that the optimization process tries to determine the specific values of the model coefficients to obtain lowest training MSE. For example, with a linear model y = ax + b
, SKLearn tries to find a
and b
so that the training MSE is as low as possible. As the polynomial degree grows, we have more coefficients, for example, a quadratic model has three y = a0 + a1x + a2x2
, a cubic model four y = a0 + a1x + a2x2 + a3x3
. The increase in numbers of coefficients is more rapidly the more features we have.
Now, we also discussed how having more complexity also means that our models become more flexible. At some points, they become too flexible and try to fit into fake patterns just so the training MSE is the lowest. Below is one example. On the same data, a linear model fits a straight line the, a polynomial model at degree 10 fits a very odd curve, and at degree 20, the model straight up creates imaginary patterns by just connecting some instances.
linear model
degree-10 polynomial
degree-20 polynomial
So, the root of the problem here is that, the model are given too much freedom to fit anything it sees in the training data. And what is the solution for too much freedom? Of course, take some of those away! Sorry, bad joke, I know. But for real, a strategy to deal with overfitting is to restrain what the model could be, or in a linear model, restrain the coefficients from growing too extreme. And this is the main idea behind model regularization.
Model regularization
In general, model regularization means to control the model complexity while balancing it with the training error. The idea is, the model should be complex just enough. Different models have different way of regularizing. So far, we have only learned linear regression, so let us talk about regularizing this model.
Let us have a general data set of k features (that could be independent or created from polynomial or interactions) x1
, x2
, to xk
. A linear model fitted on this data has the equation
And, training this model means to find a set of values for a0
, a1
, a2
, to ak
so that the training MSE is the lowest. Minimizing the training MSE is called the training objective of this model which, so far, has not had any controls over the values of a0
to ak
. To incorporate some restrictions on the coefficients is actually not difficult, we can add a penalty term to the training objective
minimize
With penalty
increases as the model becomes more complex. One way to define the penalty is to have penalty = a02 + a12 + ... ak2
. With this penalty
, as any coefficients get very large in scale (either positive or negative), the training objective significantly increases as well. This increase prevents the current set of coefficient values from yielding the minimum training objective, even if their training MSE is the lowest, leading them to not being selected. So, the optimal solution here must balance between having small enough MSE and reasonable values of coefficients. Overall, we have added constraints on the model and keep its complexity in check! This method that I have just described here is a simple version of Ridge regression.
Ridge regression demonstration
I hope my explanation about regularization and ridge regression is understandable. Now, let us move on to some demonstration. Let us still use the auto-mpg
data and see if we can really solve the previous issue now. As usual, I will hide the part with loading data, split train-test, and building the categorical pipeline since it is just like before. Also, you can get the complete notebook here.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = pd.read_csv('auto-mpg.csv')
train, test = train_test_split(data, test_size=0.2)
num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#pipeline for class features
cat_pipeline = Pipeline([
('encoder', OneHotEncoder())
])
Linear model
We will start with fitting a regular linear regression model for references. It gets a CV R2 of 0.806
, which we will use as our baseline for other models.
num_pipeline_linear = Pipeline([
('impute', SimpleImputer(strategy='median')),
('standardize', StandardScaler()),
])
data_pipeline_linear = ColumnTransformer([
('numeric', num_pipeline_linear, num_cols),
('class', cat_pipeline, cat_cols)
])
linear_reg_pipeline = Pipeline([
('processing', data_pipeline_linear),
('modeling', LinearRegression())
])
r2_10cv = cross_val_score(linear_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.8060313624814313
Cubic model
Remember how the cubic model was really struggling with overfitting? I will redo that here to get the CV R2 of the current split, which is 0.728
, significantly lower than the baseline 0.806
and is obviously overfitting.
#pipeline for numeric features
num_pipeline_poly3 = Pipeline([
('impute', SimpleImputer(strategy='median')),
('polynomial', PolynomialFeatures(degree=3)),
('standardize', StandardScaler()),
])
#full processing pipeline
data_pipeline_poly3 = ColumnTransformer([
('numeric', num_pipeline_poly3, num_cols),
('class', cat_pipeline, cat_cols)
])
#model pipeline
poly3_reg_pipeline = Pipeline([
('processing', data_pipeline_poly3),
('modeling', LinearRegression())
])
poly3_reg_pipeline.fit(train, train[[target]])
Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('impute', SimpleImputer(strategy='median')), ('polynomial', PolynomialFeatures(degree=3)), ('standardize', StandardScaler())]), ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']), ('class', Pipeline(steps=[('encoder', OneHotEncoder())]), ['origin'])])), ('modeling', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('impute', SimpleImputer(strategy='median')), ('polynomial', PolynomialFeatures(degree=3)), ('standardize', StandardScaler())]), ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']), ('class', Pipeline(steps=[('encoder', OneHotEncoder())]), ['origin'])])), ('modeling', LinearRegression())])
ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('impute', SimpleImputer(strategy='median')), ('polynomial', PolynomialFeatures(degree=3)), ('standardize', StandardScaler())]), ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']), ('class', Pipeline(steps=[('encoder', OneHotEncoder())]), ['origin'])])
['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
SimpleImputer(strategy='median')
PolynomialFeatures(degree=3)
StandardScaler()
['origin']
OneHotEncoder()
LinearRegression()
from sklearn.model_selection import cross_val_score
r2_10cv = cross_val_score(poly3_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.7284627674958539
Now, we will try using Ridge regression on the same cubic transformed data. This time, we use the Ridge
model class instead of LinearRegression
and add one on top of the processing pipeline. Do you get amazed by the result? We now get a 0.853
CV R2, way higher than the baseline. The model no longer overfits and can even handle the nonlinearity now!
from sklearn.linear_model import Ridge
ridge_reg_pipeline = Pipeline([
('processing', data_pipeline_poly3),
('modeling', Ridge())
])
r2_10cv = cross_val_score(ridge_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.8531922643781689
Quadratic and bi-quadratic models
Just for sanity check, let us try Ridge regression on the quadratic and bi-quadratic data. The code is identical except for for polynomial degree and variable names, so it is hidden. Anyway, the quadratic model get a CV R2 or 0.852, and the bi-quadratic 0.856.
num_pipeline_poly2 = Pipeline([
('impute', SimpleImputer(strategy='median')),
('polynomial', PolynomialFeatures(degree=2)),
('standardize', StandardScaler()),
])
data_pipeline_poly2 = ColumnTransformer([
('numeric', num_pipeline_poly2, num_cols),
('class', cat_pipeline, cat_cols)
])
ridge_reg_pipeline_2 = Pipeline([
('processing', data_pipeline_poly2),
('modeling', Ridge())
])
r2_10cv = cross_val_score(ridge_reg_pipeline_2, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.8518732915328598
num_pipeline_poly4 = Pipeline([
('impute', SimpleImputer(strategy='median')),
('polynomial', PolynomialFeatures(degree=4)),
('standardize', StandardScaler()),
])
data_pipeline_poly4 = ColumnTransformer([
('numeric', num_pipeline_poly4, num_cols),
('class', cat_pipeline, cat_cols)
])
ridge_reg_pipeline_4 = Pipeline([
('processing', data_pipeline_poly4),
('modeling', Ridge())
])
r2_10cv = cross_val_score(ridge_reg_pipeline_4, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.8556905493636462
Which model should we use?
While CV R2 seems to increase with polynomial degrees, the difference is fairly negligible. Plus, they can be purely caused by the different CV splits that SKLearn performed. Given the tradeoff in model and data complexity, I would totally stay with the quadratic data and model. For your reference, the quadratic data has 31 features, cubic data 87 features, and bi-quadratic… 213 features! Below is the summary table for your convenience.
Linear | Quadratic | Cubic | bi-Quad | |
CV R2 | 0.806 | 0.852 | 0.853 | 0.856 |
No. features | 9 | 31 | 87 | 213 |
Conclusion
In this post, we have learned more about the cause of overfitting, and got to know one solution for that which is model regularization. Regularization is a very important technique in data analytics and machine learning, so please take your time to understand the concept. Furthermore, we have talked about and showcased one simplified version of Ridge regression. However, there is not just that in this model. So, we will continue in the next post. Until next time!
Pingback: Model Tuning - Data Science from a Practical Perspective