The basic linear regression model is fairly limited in that it requires the features and the target to have a linear correlation to perform well. In practice though, we commonly observe nonlinear correlations between features and targets which need more advance models to fully take advantages. However, let us not jump to the complicated machine learning algorithms just yet. There is still a way to adapt linear models to nonlinearity, which is to use polynomial regression. So, in this post, we will discuss these models and how to use them.
Polynomial patterns
Hopefully, you remember, a simple linear regression model fits an equation y = ax + b
between the target y
and the feature x
. The equation is linear and has the form of a straight line when being draw on a two-dimensional coordination system. Therefore, straight line patterns on scatter plots suggest a linear correlation between the the target and the feature.
So what if we see a curvy pattern instead? It means the correlation is probably nonlinear. One easy way to address this is to make the equation nonlinear as well by increasing its polynomial degree. Roughly speaking, a polynomial model with degree of k
with one feature has up to the term xk
in its equation and has the form below:
For each feature more, we have another set of its powers from 1
to k
. So, a linear regression model is the case of polynomial with degree of 1. For example, with two features x1
and x2
, a degree 2
polynomial model is as
The higher k
, the more complicated patterns our equation can represent. Below are the example of the pattern that a polynomial model can handle with k=1
, 2
, 3
, and 4
. One thing for you to note is that, while k is not bounded, we rarely use k higher than 2
. If you feel the need of having a cubic model or above, you probably should just use a machine learning model instead.
k=1
(linear)
k=3
(cubic)
k=2
(quadratic)
k=4
(bi-quadratic)
Interactions between features
There is another concept in polynomial regression which is interaction. Simply speaking, the interaction between two features is their product. An interaction of degree k
means that the powers of the two features in the products sum to k
. For example, a quadratic interaction between x1
and x2
is x1x2
, however, their cubic interaction is either x12x2
or x1x22
. A complete polynomial model of degree k
can include up to that degree of interactions for every pair of features. For example, with two features x1
and x2
, the quadratic and cubic models are as follows. Compared to a linear model with its equation y = a0 + a1x1 + a2x2
, you can see the model complexity increases very fast as k
raises.
So why do we need such complications? Because sometimes, the target is correlated to the interactions but seems random to the individual features. Below is one of such examples. Looking at the first two scatter plots, we do not see any meaningful patterns, however, with the interaction x1*x2
, y
has a very strong correlation. Regardless, we rarely go higher than quadratic models as discussed previously.
x1
and y
x2
and y
x1*x2
and y
The auto-mpg data
The complete notebook for this post is available here. For demonstration, we will use the auto-mpg data in this post. This data set consists of several features of different car models, and the target is to predict their miles-per-gallon (mpg). The original data can be obtain from the UCI machine learning repository. We start with importing, train-test split, and info()
. Nothing seems too unusual here, except for origin
which I think is the code for the area that produced the cars (e.g., Asia, Euro, North America) instead of actual numbers, so I will consider it categorical.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = pd.read_csv('auto-mpg.csv')
train, test = train_test_split(data, test_size=0.2)
train.head(n=3)
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | carname | |
---|---|---|---|---|---|---|---|---|---|
284 | 20.6 | 6 | 225.0 | 110.0 | 3360 | 16.6 | 79 | 1 | dodge aspen 6 |
18 | 27.0 | 4 | 97.0 | 88.0 | 2130 | 14.5 | 70 | 3 | datsun pl510 |
88 | 14.0 | 8 | 302.0 | 137.0 | 4042 | 14.5 | 73 | 1 | ford gran torino |
train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 318 entries, 284 to 100 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 318 non-null float64 1 cylinders 318 non-null int64 2 displacement 318 non-null float64 3 horsepower 314 non-null float64 4 weight 318 non-null int64 5 acceleration 318 non-null float64 6 year 318 non-null int64 7 origin 318 non-null int64 8 carname 318 non-null object dtypes: float64(4), int64(4), object(1) memory usage: 24.8+ KB
Next, we investigate the distributions of numeric columns and their correlations with scatter_matrix()
. All distributions are skewed, however I will not perform a log transformation since it will just complicate this post. We further observe that the correlations of mpg
with the other features are all nonlinear, so it is an ideal case to demonstrate polynomial regression.
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(train[num_cols+['mpg']], figsize=(10,10))
plt.show()
Linear regression
Let us still start with a linear model and see how it performs. Below is the typical pipeline that we have seen many times. Numeric features undergo standardization then imputation, and categorical one one hot encoder. The linear model is placed at the end of the pipeline. Finally, using cross-validation scoring, we obtain the CV R2 of 0.82
, meaning the linear model can explain 82% of variations in mpg
, which is not bad.
num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
#pipeline for numeric features
num_pipeline = Pipeline([
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#pipeline for class features
cat_pipeline = Pipeline([
('encoder', OneHotEncoder())
])
#full pipeline - combine numeric and class pipelines
data_pipeline = ColumnTransformer([
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
linear_reg_pipeline = Pipeline([
('processing', data_pipeline),
('modeling', LinearRegression())
])
from sklearn.model_selection import cross_val_score
r2_10cv = cross_val_score(linear_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.8202131342038648
Quadratic model
With a linear model to compare, now let us try a degree 2, the quadratic model which is very easy in SKLearn. We just use the transformer PolynomialFeatures
to generate polynomial data including interactions. PolynomialFeatures
takes one input as the degree. We set it to 2
for quadratic models. In terms of the order of steps in the processing pipeline, I put PolynomialFeatures
at the beginning so to make sure processed data indeed has means of 0
and standard deviations of 1
. We reuse the categorical pipeline previously, so no needs to redefine it. Finally, we obtain the CV R2 of this model at 0.861
, which outperforms the linear model quite a bit. So in this data, polynomial regression is indeed better as suggested by the scatter plots.
from sklearn.preprocessing import PolynomialFeatures
num_pipeline_poly2 = Pipeline([
('polynomial', PolynomialFeatures(degree=2)),
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#full pipeline - combine numeric and class pipelines
data_pipeline_poly2 = ColumnTransformer([
('numeric', num_pipeline_poly2, num_cols),
('class', cat_pipeline, cat_cols)
])
poly2_reg_pipeline = Pipeline([
('processing', data_pipeline_poly2),
('modeling', LinearRegression())
])
r2_10cv = cross_val_score(poly2_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.860756124393966
Higher degrees?
Would using higher degree polynomial regression give us even better performances? Let us try right now! First, for cubic model, we change degree
to 3
in the pipeline:
num_pipeline_poly3 = Pipeline([
('polynomial', PolynomialFeatures(degree=3)),
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#pipeline for class features
data_pipeline_poly3 = ColumnTransformer([
('numeric', num_pipeline_poly3, num_cols),
('class', cat_pipeline, cat_cols)
])
poly3_reg_pipeline = Pipeline([
('processing', data_pipeline_poly3),
('modeling', LinearRegression())
])
r2_10cv = cross_val_score(poly3_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
0.5773308094289162
Then bi-quadratic model with degree=4
num_pipeline_poly4 = Pipeline([
('polynomial', PolynomialFeatures(degree=4)),
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#full pipeline - combine numeric and class pipelines
data_pipeline_poly4 = ColumnTransformer([
('numeric', num_pipeline_poly4, num_cols),
('class', cat_pipeline, cat_cols)
])
poly4_reg_pipeline = Pipeline([
('processing', data_pipeline_poly4),
('modeling', LinearRegression())
])
r2_10cv = cross_val_score(poly4_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)
-125.40707047461794
So, the cubic model gets a CV R2 of… 0.577
, and the bi-quadratic,….. -125.4
!? What happened? Well, short answer, having more complicated models does not always mean higher performance. And the long answer? let us spend a whole post on that!
Conclusion
In this post, we have gone through concepts of polynomial regression, which is a more generalize version of linear regression to deal with nonlinear patterns or correlations in data. And, I promise, we will discuss the interesting phenomenon when we use cubic or bi-quadratic models in the auto-mpg
data. It deserves its own post! So, until next time.
Pingback: The overfitting problem - Data Science from a Practical Perspective