Polynomial Regression

an illustration of nonlinear pattern and a polynomial regression model

The basic linear regression model is fairly limited in that it requires the features and the target to have a linear correlation to perform well. In practice though, we commonly observe nonlinear correlations between features and targets which need more advance models to fully take advantages. However, let us not jump to the complicated machine learning algorithms just yet. There is still a way to adapt linear models to nonlinearity, which is to use polynomial regression. So, in this post, we will discuss these models and how to use them.

Polynomial patterns

Hopefully, you remember, a simple linear regression model fits an equation y = ax + b between the target y and the feature x. The equation is linear and has the form of a straight line when being draw on a two-dimensional coordination system. Therefore, straight line patterns on scatter plots suggest a linear correlation between the the target and the feature.

So what if we see a curvy pattern instead? It means the correlation is probably nonlinear. One easy way to address this is to make the equation nonlinear as well by increasing its polynomial degree. Roughly speaking, a polynomial model with degree of k with one feature has up to the term xk in its equation and has the form below:

y = a_0 + a_1x + a_2x^2 + ... + a_kx^k

For each feature more, we have another set of its powers from 1 to k. So, a linear regression model is the case of polynomial with degree of 1. For example, with two features x1 and x2, a degree 2 polynomial model is as

y = a_0 + a_1x + a_2x^2 + a_3x_2 + a_4x_2^2

The higher k, the more complicated patterns our equation can represent. Below are the example of the pattern that a polynomial model can handle with k=1, 2, 3, and 4. One thing for you to note is that, while k is not bounded, we rarely use k higher than 2. If you feel the need of having a cubic model or above, you probably should just use a machine learning model instead.

an illustration of a linear correlation

k=1 (linear)

an illustration of a cubic correlation

k=3 (cubic)

an illustration of a quadratic correlation

k=2 (quadratic)

an illustration of a bi-quadratic correlation

k=4 (bi-quadratic)

Interactions between features

There is another concept in polynomial regression which is interaction. Simply speaking, the interaction between two features is their product. An interaction of degree k means that the powers of the two features in the products sum to k. For example, a quadratic interaction between x1 and x2 is x1x2, however, their cubic interaction is either x12x2 or x1x22. A complete polynomial model of degree k can include up to that degree of interactions for every pair of features. For example, with two features x1 and x2, the quadratic and cubic models are as follows. Compared to a linear model with its equation y = a0 + a1x1 + a2x2, you can see the model complexity increases very fast as k raises.

y = a_0 + a_1x_1 + a_2x_1^2 + a_3x_2 + a_4x_2^2 + a_5x_1x_2

y = a_0 + a_1x_1 + a_2x_1^2 + a_3x_1^3 + a_4x_2 + a_5x_2^2 + a_6x_2^3 + a_7x_1x_2 + a_8x_1^2x_2 + a_9x_1x_2^2

So why do we need such complications? Because sometimes, the target is correlated to the interactions but seems random to the individual features. Below is one of such examples. Looking at the first two scatter plots, we do not see any meaningful patterns, however, with the interaction x1*x2, y has a very strong correlation. Regardless, we rarely go higher than quadratic models as discussed previously.

an illustration of no correlation

x1 and y

an illustration of no correlation

x2 and y

an illustration of strong correlation with interaction

x1*x2 and y

The auto-mpg data

The complete notebook for this post is available here. For demonstration, we will use the auto-mpg data in this post. This data set consists of several features of different car models, and the target is to predict their miles-per-gallon (mpg). The original data can be obtain from the UCI machine learning repository. We start with importing, train-test split, and info(). Nothing seems too unusual here, except for origin which I think is the code for the area that produced the cars (e.g., Asia, Euro, North America) instead of actual numbers, so I will consider it categorical.

Next, we investigate the distributions of numeric columns and their correlations with scatter_matrix(). All distributions are skewed, however I will not perform a log transformation since it will just complicate this post. We further observe that the correlations of mpg with the other features are all nonlinear, so it is an ideal case to demonstrate polynomial regression.

Linear regression

Let us still start with a linear model and see how it performs. Below is the typical pipeline that we have seen many times. Numeric features undergo standardization then imputation, and categorical one one hot encoder. The linear model is placed at the end of the pipeline. Finally, using cross-validation scoring, we obtain the CV R2 of 0.82, meaning the linear model can explain 82% of variations in mpg, which is not bad.

Quadratic model

With a linear model to compare, now let us try a degree 2, the quadratic model which is very easy in SKLearn. We just use the transformer PolynomialFeatures to generate polynomial data including interactions. PolynomialFeatures takes one input as the degree. We set it to 2 for quadratic models. In terms of the order of steps in the processing pipeline, I put PolynomialFeatures at the beginning so to make sure processed data indeed has means of 0 and standard deviations of 1. We reuse the categorical pipeline previously, so no needs to redefine it. Finally, we obtain the CV R2 of this model at 0.861, which outperforms the linear model quite a bit. So in this data, polynomial regression is indeed better as suggested by the scatter plots.

Higher degrees?

Would using higher degree polynomial regression give us even better performances? Let us try right now! First, for cubic model, we change degree to 3 in the pipeline:

Then bi-quadratic model with degree=4

So, the cubic model gets a CV R2 of… 0.577, and the bi-quadratic,….. -125.4!? What happened? Well, short answer, having more complicated models does not always mean higher performance. And the long answer? let us spend a whole post on that!

Conclusion

In this post, we have gone through concepts of polynomial regression, which is a more generalize version of linear regression to deal with nonlinear patterns or correlations in data. And, I promise, we will discuss the interesting phenomenon when we use cubic or bi-quadratic models in the auto-mpg data. It deserves its own post! So, until next time.

1 Comment

Comments are closed