Previously, we have learned about model tuning as well as discussed more about Ridge regression which is one of the regularized linear models. As it turns out, Ridge regression is not the only way to regularize a linear model. There are also Lasso and Elastic Net models. Since we have known Ridge already, we should just go ahead and learn the rest. I think the approaches these models penalize complexities and their results are very interesting to observe. So, let us wait no longer!

Regularized linear models

Let me just start with saying that, while vary in methods, the three models all follow the same idea of regularizing. They all have the training objective as

minimize $training\_MSE + \alpha\times penalty \qquad (\alpha > 0)$

where α is the term to control the strength of regularization. The only major difference is in how they calculate the penalty. As we have known, in Ridge regression, the penalty is the sum of squared coefficients. More specifically, given a model y = a₀ + a₁x₁ + a₂x₂+ ... + a_kx_k, Ridge has penalty = a₀² + a₁² + a₂²+ ... + a_k².

Along that line, Lasso replaces the sum of squared coefficients in the penalty as sum of absolute coefficients: penalty = |a₀| + |a₁| + |a₂| + ... + |a_k|. Lastly, Elastic Net is pretty much a mixture of Ridge and Lasso. This model calculate the penalty as a hybrid of sum of squared coefficients and sum of absolute coefficients: penalty = ρ(|a₀| + ... + |a_k|) + (1-ρ)(a₀² + ... + a_k²) with ρ being a number between 0 and 1. The closer ρ to 0, the more similar to Ridge this model is. On the other hand, the closer ρ to 1, the more closer to Lasso Elastic net is.

The impact of penalties

What do all of these achieve? In short, they differ the impact of the coefficients’ amplitudes on the penalty. Below are the plots that representing the contribute of a coefficient to the penalty as its value changes across the three models.

Ridge

Lasso

Elastic Net

In Ridge, the penalty is a squared function of coefficients. Therefore, the higher scale a coefficient, the more signified its contribution to the penalty. For example, a coefficient of 2 adds 4 to the penalty, 5 adds 25, and 10 adds 100. On the other hand, this contribution deflates as the coefficient reaches 0: 0.5 adds 0.25, 0.1 adds 0.01, etc. Overall, in Ridge regression, very high coefficients are penalized heavily, whereas small coefficients are pretty much left alone. So, in Ridge regression, we tend to see less high-amplitude coefficients, and more those below 1, compared to the other two methods.

How about Lasso? In this case, the penalty is the absolute of the coefficients. Therefore, the contribution of a coefficient is constant with respect to its amplitude. For this reason, coefficients can grow higher in Lasso. Furthermore, those that are small tend to collapse to 0, which effectively remove their features from the model. This effect gives Lasso the name “sparse model“. Finally, being a hybrid model, Elastic Net can be adjusted to fit into specific data sets by finetuning the ρ term. Overall, it get more flexibility than Ridge and Lasso.

Regularized linear models in SKLearn

Finally, let us get some hands-on in Python with the three models. We will still be working with the auto-mpg data. However, since we need to finetune and apply three models this time, I split the data processing pipeline to avoid retraining and retransforming data multiple time. At the end, we store the processed training data is in train_features, and testing data test_features. Please expand if you would like to view the complete pipeline code. You can also download the notebook here.

Click to expand

In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = pd.read_csv('auto-mpg.csv')

train, test = train_test_split(data, test_size=0.2)

num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

#pipeline for class features
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])

num_pipeline = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('polynomial', PolynomialFeatures(degree=2)),
    ('standardize', StandardScaler()),
])

data_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

In [ ]:

train_features = data_pipeline.fit_transform(train)
test_features = data_pipeline.transform(test)

Ridge

We have learned to create and tune a Ridge regression model previously. It is just the same here. The only different is that here, we apply GridSearchCV only on the model, not the complete pipeline. Therefore, the hyper-parameter names no longer needs the ridge__ part. After training, we can observe that the best alpha is 1, the CV R2 of the model is 0.846, and testing R2 0.892.

In [2]:

from sklearn.model_selection import GridSearchCV

param_grid = [{'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100, 500, 1000]}]

grid_search = GridSearchCV(Ridge(), param_grid, cv=10, scoring='r2', return_train_score=True)

grid_search.fit(train_features,train[[target]])

Out[2]:

GridSearchCV(cv=10, estimator=Ridge(),
             param_grid=[{'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5,
                                    10, 50, 100, 500, 1000]}],
             return_train_score=True, scoring='r2')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [3]:

grid_search.best_params_

Out[3]:

{'alpha': 1}

In [4]:

grid_search.best_score_

Out[4]:

0.8463973774175086

In [5]:

best_ridge_reg = grid_search.best_estimator_

best_ridge_reg.score(test_features, test[[target]])

Out[5]:

0.8921331386167156

Lasso

Next, we perform Lasso model on the same processed data. The code is pretty much the same as in Ridge. We only replace the model with Lasso. In terms of result, the best alpha is 0.01, training CV R2 is 0.848, and testing R2 0.889. So, in this split, Lasso is pretty much the same as Ridge. It is slightly better in training CV, but a touch worse in testing data.

In [8]:

from sklearn.linear_model import Lasso

param_grid = [{'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

grid_search = GridSearchCV(Lasso(), param_grid, cv=10, scoring='r2', return_train_score=True)

grid_search.fit(train_features, train[[target]])

Out[8]:

GridSearchCV(cv=10, estimator=Lasso(),
             param_grid=[{'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5,
                                    10, 50, 100]}],
             return_train_score=True, scoring='r2')

In [9]:

grid_search.best_params_

Out[9]:

{'alpha': 0.01}

In [10]:

grid_search.best_score_

Out[10]:

0.8479689185491377

In [11]:

best_lasso_reg = grid_search.best_estimator_

best_lasso_reg.score(test_features, test[[target]])

Out[11]:

0.8888602375125443

Elastic Net

Finally, let try fitting Elastic Net on this data. As we discuss, the penalty term in Elastic net is a mixture of Ridge and Lasso, with the mixing proportion decided by ρ. In this case, ρ is also a hyper-parameter to finetune which is the l1_ratio term in SKLearn. Therefore, our param_grid now has two hyper-paramters, alpha and l1_ratio. Outside of this, we return to the same code as Ridge and Lasso. After tuning, the best model has alpha of 0.001 and l1_ratio of 0.2 which means this model behaves more closely to Ridge. It gets the training CV R2 of 0.853, and testing R2 of 0.895, and is the best among three.

In [14]:

from sklearn.linear_model import ElasticNet

param_grid = [{
    'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100],
    'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
}]

grid_search = GridSearchCV(ElasticNet(max_iter=5000), param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(train_features, train[[target]])

Out[14]:

GridSearchCV(cv=5, estimator=ElasticNet(max_iter=5000),
             param_grid=[{'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5,
                                    10, 50, 100],
                          'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                                       0.9]}],
             return_train_score=True, scoring='r2')

In [15]:

grid_search.best_params_

Out[15]:

{'alpha': 0.001, 'l1_ratio': 0.2}

In [16]:

grid_search.best_score_

Out[16]:

0.8534936887179224

In [17]:

best_enet_reg = grid_search.best_estimator_

best_enet_reg.score(test_features, test[[target]])

Out[17]:

0.8945114660945874

Conclusion

After this post, hopefully you have had a good idea on the concepts as well as the implementations of regularized linear models including Ridge, Lasso, and Elastic Net. Personally, I think that, regardless of regularization methods, these three are still linear models. So, their performances would come pretty close to each other most of the times. Regardless, if your whole intention is the best performance possible, then it is still worth trying all and pick the best one. With that, I will conclude this post. See you again next time!