Model Regularization

an illustration of model regularization in which models are restricted on where they can get to ensure generalizability on data

Previously, we have discussed the overfitting problem where a model overlearns the training data and fails to generalize. There are a lot of potential issues here. First, overfitting is obviously bad because it makes our model… bad, and should be avoided. However, we saw in our small experiment, a linear model could not handle the nonlinearity and underfitted, but a quadratic model overfitted. There are no 1.5-degree polynomial models for us to use here (well there are options, but they are manual and hard to be automated). So, do we have to choose between underfitting and overfitting? No, we do not! There are ways to let have models get to the necessary complexity to deal with nonlinearity while not overfitting data. In this post, we will discuss one of such methods, which is model regularization.

So what causes overfitting, really?

As we learned in the last post, what happens during model training is that the optimization process tries to determine the specific values of the model coefficients to obtain lowest training MSE. For example, with a linear model y = ax + b, SKLearn tries to find a and b so that the training MSE is as low as possible. As the polynomial degree grows, we have more coefficients, for example, a quadratic model has three y = a0 + a1x + a2x2, a cubic model four y = a0 + a1x + a2x2 + a3x3. The increase in numbers of coefficients is more rapidly the more features we have.

Now, we also discussed how having more complexity also means that our models become more flexible. At some points, they become too flexible and try to fit into fake patterns just so the training MSE is the lowest. Below is one example. On the same data, a linear model fits a straight line the, a polynomial model at degree 10 fits a very odd curve, and at degree 20, the model straight up creates imaginary patterns by just connecting some instances.

a linear model with a good fit

linear model

a degree 10 model with a bad fit

degree-10 polynomial

a degree 20 model with an extremely bad fit

degree-20 polynomial

So, the root of the problem here is that, the model are given too much freedom to fit anything it sees in the training data. And what is the solution for too much freedom? Of course, take some of those away! Sorry, bad joke, I know. But for real, a strategy to deal with overfitting is to restrain what the model could be, or in a linear model, restrain the coefficients from growing too extreme. And this is the main idea behind model regularization.

Model regularization

In general, model regularization means to control the model complexity while balancing it with the training error. The idea is, the model should be complex just enough. Different models have different way of regularizing. So far, we have only learned linear regression, so let us talk about regularizing this model.

Let us have a general data set of k features (that could be independent or created from polynomial or interactions) x1, x2, to xk. A linear model fitted on this data has the equation

y = a_0 + a_1x_1 + a_2x_2 + ... + a_kx_k

And, training this model means to find a set of values for a0, a1, a2, to ak so that the training MSE is the lowest. Minimizing the training MSE is called the training objective of this model which, so far, has not had any controls over the values of a0 to ak. To incorporate some restrictions on the coefficients is actually not difficult, we can add a penalty term to the training objective

minimize    training\_MSE + penalty

With penalty increases as the model becomes more complex. One way to define the penalty is to have penalty = a02 + a12 + ... ak2. With this penalty, as any coefficients get very large in scale (either positive or negative), the training objective significantly increases as well. This increase prevents the current set of coefficient values from yielding the minimum training objective, even if their training MSE is the lowest, leading them to not being selected. So, the optimal solution here must balance between having small enough MSE and reasonable values of coefficients. Overall, we have added constraints on the model and keep its complexity in check! This method that I have just described here is a simple version of Ridge regression.

Ridge regression demonstration

I hope my explanation about regularization and ridge regression is understandable. Now, let us move on to some demonstration. Let us still use the auto-mpg data and see if we can really solve the previous issue now. As usual, I will hide the part with loading data, split train-test, and building the categorical pipeline since it is just like before. Also, you can get the complete notebook here.

Linear model

We will start with fitting a regular linear regression model for references. It gets a CV R2 of 0.806, which we will use as our baseline for other models.

Cubic model

Remember how the cubic model was really struggling with overfitting? I will redo that here to get the CV R2 of the current split, which is 0.728, significantly lower than the baseline 0.806 and is obviously overfitting.

Now, we will try using Ridge regression on the same cubic transformed data. This time, we use the Ridge model class instead of LinearRegression and add one on top of the processing pipeline. Do you get amazed by the result? We now get a 0.853 CV R2, way higher than the baseline. The model no longer overfits and can even handle the nonlinearity now!

Quadratic and bi-quadratic models

Just for sanity check, let us try Ridge regression on the quadratic and bi-quadratic data. The code is identical except for for polynomial degree and variable names, so it is hidden. Anyway, the quadratic model get a CV R2 or 0.852, and the bi-quadratic 0.856.

Which model should we use?

While CV R2 seems to increase with polynomial degrees, the difference is fairly negligible. Plus, they can be purely caused by the different CV splits that SKLearn performed. Given the tradeoff in model and data complexity, I would totally stay with the quadratic data and model. For your reference, the quadratic data has 31 features, cubic data 87 features, and bi-quadratic… 213 features! Below is the summary table for your convenience.

Linear Quadratic Cubic bi-Quad
CV R20.8060.8520.8530.856
No. features93187213

Conclusion

In this post, we have learned more about the cause of overfitting, and got to know one solution for that which is model regularization. Regularization is a very important technique in data analytics and machine learning, so please take your time to understand the concept. Furthermore, we have talked about and showcased one simplified version of Ridge regression. However, there is not just that in this model. So, we will continue in the next post. Until next time!

1 Comment

Comments are closed