an illustration of nonlinear pattern and a polynomial regression model

The basic linear regression model is fairly limited in that it requires the features and the target to have a linear correlation to perform well. In practice though, we commonly observe nonlinear correlations between features and targets which need more advance models to fully take advantages. However, let us not jump to the complicated machine learning algorithms just yet. There is still a way to adapt linear models to nonlinearity, which is to use polynomial regression. So, in this post, we will discuss these models and how to use them.

Polynomial patterns

Hopefully, you remember, a simple linear regression model fits an equation y = ax + b between the target y and the feature x. The equation is linear and has the form of a straight line when being draw on a two-dimensional coordination system. Therefore, straight line patterns on scatter plots suggest a linear correlation between the the target and the feature.

So what if we see a curvy pattern instead? It means the correlation is probably nonlinear. One easy way to address this is to make the equation nonlinear as well by increasing its polynomial degree. Roughly speaking, a polynomial model with degree of k with one feature has up to the term x^k in its equation and has the form below:

$y = a_0 + a_1x + a_2x^2 + ... + a_kx^k$

For each feature more, we have another set of its powers from 1 to k. So, a linear regression model is the case of polynomial with degree of 1. For example, with two features x₁ and x₂, a degree 2 polynomial model is as

$y = a_0 + a_1x + a_2x^2 + a_3x_2 + a_4x_2^2$

The higher k, the more complicated patterns our equation can represent. Below are the example of the pattern that a polynomial model can handle with k=1, 2, 3, and 4. One thing for you to note is that, while k is not bounded, we rarely use k higher than 2. If you feel the need of having a cubic model or above, you probably should just use a machine learning model instead.

k=1 (linear)

k=3 (cubic)

an illustration of a quadratic correlation

k=2 (quadratic)

an illustration of a bi-quadratic correlation

k=4 (bi-quadratic)

Interactions between features

There is another concept in polynomial regression which is interaction. Simply speaking, the interaction between two features is their product. An interaction of degree k means that the powers of the two features in the products sum to k. For example, a quadratic interaction between x₁ and x₂ is x₁x₂, however, their cubic interaction is either x₁²x₂ or x₁x₂². A complete polynomial model of degree k can include up to that degree of interactions for every pair of features. For example, with two features x₁ and x₂, the quadratic and cubic models are as follows. Compared to a linear model with its equation y = a₀ + a₁x₁ + a₂x₂, you can see the model complexity increases very fast as k raises.

$y = a_0 + a_1x_1 + a_2x_1^2 + a_3x_2 + a_4x_2^2 + a_5x_1x_2$

$y = a_0 + a_1x_1 + a_2x_1^2 + a_3x_1^3 + a_4x_2 + a_5x_2^2 + a_6x_2^3 + a_7x_1x_2 + a_8x_1^2x_2 + a_9x_1x_2^2$

So why do we need such complications? Because sometimes, the target is correlated to the interactions but seems random to the individual features. Below is one of such examples. Looking at the first two scatter plots, we do not see any meaningful patterns, however, with the interaction x₁*x₂, y has a very strong correlation. Regardless, we rarely go higher than quadratic models as discussed previously.

x₁ and y

x₂ and y

an illustration of strong correlation with interaction

x₁*x₂ and y

The auto-mpg data

The complete notebook for this post is available here. For demonstration, we will use the auto-mpg data in this post. This data set consists of several features of different car models, and the target is to predict their miles-per-gallon (mpg). The original data can be obtain from the UCI machine learning repository. We start with importing, train-test split, and info(). Nothing seems too unusual here, except for origin which I think is the code for the area that produced the cars (e.g., Asia, Euro, North America) instead of actual numbers, so I will consider it categorical.

auto-mpg.csv Download

In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = pd.read_csv('auto-mpg.csv')

train, test = train_test_split(data, test_size=0.2)
train.head(n=3)

Out[1]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	carname
284	20.6	6	225.0	110.0	3360	16.6	79	1	dodge aspen 6
18	27.0	4	97.0	88.0	2130	14.5	70	3	datsun pl510
88	14.0	8	302.0	137.0	4042	14.5	73	1	ford gran torino

In [2]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 318 entries, 284 to 100
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           318 non-null    float64
 1   cylinders     318 non-null    int64  
 2   displacement  318 non-null    float64
 3   horsepower    314 non-null    float64
 4   weight        318 non-null    int64  
 5   acceleration  318 non-null    float64
 6   year          318 non-null    int64  
 7   origin        318 non-null    int64  
 8   carname       318 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 24.8+ KB

Next, we investigate the distributions of numeric columns and their correlations with scatter_matrix(). All distributions are skewed, however I will not perform a log transformation since it will just complicate this post. We further observe that the correlations of mpg with the other features are all nonlinear, so it is an ideal case to demonstrate polynomial regression.

In [5]:

import matplotlib.pyplot as plt

pd.plotting.scatter_matrix(train[num_cols+['mpg']], figsize=(10,10))
plt.show()

Linear regression

Let us still start with a linear model and see how it performs. Below is the typical pipeline that we have seen many times. Numeric features undergo standardization then imputation, and categorical one one hot encoder. The linear model is placed at the end of the pipeline. Finally, using cross-validation scoring, we obtain the CV R2 of 0.82, meaning the linear model can explain 82% of variations in mpg, which is not bad.

In [3]:

num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

#pipeline for numeric features

num_pipeline = Pipeline([
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

#pipeline for class features
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])

#full pipeline - combine numeric and class pipelines
data_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

linear_reg_pipeline = Pipeline([
    ('processing', data_pipeline),
    ('modeling', LinearRegression())
])

In [4]:

from sklearn.model_selection import cross_val_score

r2_10cv = cross_val_score(linear_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[4]:

0.8202131342038648

Quadratic model

With a linear model to compare, now let us try a degree 2, the quadratic model which is very easy in SKLearn. We just use the transformer PolynomialFeatures to generate polynomial data including interactions. PolynomialFeatures takes one input as the degree. We set it to 2 for quadratic models. In terms of the order of steps in the processing pipeline, I put PolynomialFeatures at the beginning so to make sure processed data indeed has means of 0 and standard deviations of 1. We reuse the categorical pipeline previously, so no needs to redefine it. Finally, we obtain the CV R2 of this model at 0.861, which outperforms the linear model quite a bit. So in this data, polynomial regression is indeed better as suggested by the scatter plots.

In [6]:

from sklearn.preprocessing import PolynomialFeatures

num_pipeline_poly2 = Pipeline([
    ('polynomial', PolynomialFeatures(degree=2)),
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

#full pipeline - combine numeric and class pipelines
data_pipeline_poly2 = ColumnTransformer([
    ('numeric', num_pipeline_poly2, num_cols),
    ('class', cat_pipeline, cat_cols)
])

poly2_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly2),
    ('modeling', LinearRegression())
])

In [7]:

r2_10cv = cross_val_score(poly2_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[7]:

0.860756124393966

Higher degrees?

Would using higher degree polynomial regression give us even better performances? Let us try right now! First, for cubic model, we change degree to 3 in the pipeline:

In [8]:

num_pipeline_poly3 = Pipeline([
    ('polynomial', PolynomialFeatures(degree=3)),
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

#pipeline for class features
data_pipeline_poly3 = ColumnTransformer([
    ('numeric', num_pipeline_poly3, num_cols),
    ('class', cat_pipeline, cat_cols)
])

poly3_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly3),
    ('modeling', LinearRegression())
])

In [9]:

r2_10cv = cross_val_score(poly3_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[9]:

0.5773308094289162

Then bi-quadratic model with degree=4

In [10]:

num_pipeline_poly4 = Pipeline([
    ('polynomial', PolynomialFeatures(degree=4)),
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

#full pipeline - combine numeric and class pipelines
data_pipeline_poly4 = ColumnTransformer([
    ('numeric', num_pipeline_poly4, num_cols),
    ('class', cat_pipeline, cat_cols)
])

poly4_reg_pipeline = Pipeline([
    ('processing', data_pipeline_poly4),
    ('modeling', LinearRegression())
])

In [11]:

r2_10cv = cross_val_score(poly4_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[11]:

-125.40707047461794

So, the cubic model gets a CV R2 of… 0.577, and the bi-quadratic,….. -125.4!? What happened? Well, short answer, having more complicated models does not always mean higher performance. And the long answer? let us spend a whole post on that!

Conclusion

In this post, we have gone through concepts of polynomial regression, which is a more generalize version of linear regression to deal with nonlinear patterns or correlations in data. And, I promise, we will discuss the interesting phenomenon when we use cubic or bi-quadratic models in the auto-mpg data. It deserves its own post! So, until next time.

Polynomial Regression