an illustration of the model tuning process

In the last post, we talked about model regularization and demonstrated the concept with a model called Ridge regression. However, the Ridge model we used back then was actually just a simplified version. The regular Ridge regression model not only incorporates regularization but also allows controls of its strength. And as it turns out, different strengths of regularization impact the model’s performance, and sometimes quite significantly. So, choosing a good strength level for regularization is usually needed when using Ridge regression. We refer to the strength of regularization as a hyper-parameter for Ridge regression, and the selecting process for hyper-parameters is called model tuning. In general, most models have some type of hyper-parameters that need tuning. So, in this post, we will discuss the concept of hyper-parameters and model tuning with demonstration on Ridge regression.

Ridge regression and hyper-parameters

Let us first use Ridge regression to understand the concept of hyper-parameters. With an equation of y = a₀ + a₁x₁ + a₂x₂+ ... + a_kx_k, the training objective of this model is as follows

minimize $training\_MSE + \alpha*penalty$ with $\alpha > 0$

$(penalty = a_0^2 + a_1^2 + ... + a_k^2)$

So what is different from before? If you notice, this time, there is a term α > 0 multiplying with penalty term. The idea is that α determines the level of regularization in this model. More specifically, a small α value, such as 0.001, drops the contribution of penalty to the training objective. In this case, the coefficients can have higher scales while still not increasing the training objective too much. On the other hand, a big α value, like 1000, signifies the contribution of penalty to the training objective. Now, a small increase in the coefficients’ scale may lead to a big increase in the training objective. Overall, small α means weaker regularization which allows more complex models, whereas high α means stronger regularization and keep models simpler.

Below is an example of fitting Ridge regression models with different α on the same data. We can see that the fitted curves become less complex as α grows. However, when α is too high, the curve fails to represent the underlying pattern in this data because it is regulated too strictly. This might not be the case all the times though. In reality, there are no definite rules to pick the best α, which leads us to the concept of model tuning.

a linear regression model without regularization

α = 0

a linear regression model with very week regularization

α = 0.001

a linear regression model with standard regularization

α = 1

a linear regression model without slightly strong regularization

α = 10

a linear regression model with very strong regularization

α = 100

a linear regression model with too strong regularization

α = 1000

Model tuning and Grid-Search

Model tuning

In a Ridge model, the coefficients a₀, a₁, … a_k are called trainable because they are estimated from the data. α, on the other hand, must be set before the training starts. Parameters like α are called hyper-parameters which must be set through model tuning. While it may sound complicated, tuning a model is literally trying a bunch of different values for its hyper-parameters and select those that yield the best performance.

Let us use Ridge regression as an example. Tuning a Ridge model means that we fit the it multiple times on the same training data, however, each time with a different value of α, e.g., 0.01, 0.1, 1, 10, 100. Next, we perform some kind of evaluations, for example, calculating the cross-validation R2 for each model. Let say the model with α = 0.01 gets a CV R2 of 0.8, α = 0.1 0.81, α = 1 0.83, α = 10 0.82, and α = 100 0.79, we conclude that 0.1 is the best value for α and finish the tuning process.

Grid-search

The tuning process that I have just described in the previous example is called a Grid-search cross-validation. To formalize, in a grid-search tuning process, we first define a grid that has different values for each hyper-parameter. The process will then fit a model for each combination of hyper-parameter values in the grid, calculate their CV performances, and return to us the best combination (the one with the highest CV performance). For example, we have a model has two hyper-parameter a and b. Next, we want perform a grid-search cv with a in (0.1, 1) and b in (1, 10). In this case, the process will fit and evaluate four models, (a=0.1, b=1), (a=0.1, b=10), (a=1, b=1), and (a=1, b=10) in the training data. Lastly, it selects the combination, for example (a=0.1, b=10), that gives the best CV R2 to build the final model.

Grid-search cv is not the only way to tune a model. However, it is among the easier ones to understand and to use, so let us stick with it for now.

Model tuning with SKLearn

Now, let me demonstrate the tuning process using grid-search CV using SKLearn. We still stick with the auto-mpg data and just gonna use quadratic features this time. You can expand the code to load data and build processing pipeline below if interested. The complete notebook is available here.

In [5]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = pd.read_csv('auto-mpg.csv')

train, test = train_test_split(data, test_size=0.2)

num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

#pipeline for class features
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])

num_pipeline = Pipeline([    
    ('impute', SimpleImputer(strategy='median')),
    ('polynomial', PolynomialFeatures(degree=2)),
    ('standardize', StandardScaler()),
])

data_pipeline = ColumnTransformer([
    ('numeric', num_pipeline_poly2, num_cols),
    ('class', cat_pipeline, cat_cols)
])

Setting up the grid search

To perform a grid-search tuning, we use the GridSearchCV model in SKLearn which takes two main inputs: the model (or model pipeline), and the hyper-parameter grid. First, the model pipeline is easy enough, it is just as any other models we have worked with so far. Next, we can define the hyper-parameter grid using the below syntax. We can use a dictionary for each hyper-parameter, with the key being its name, and the value, a list of values to try.

In [ ]:

<grid name> = [
    {'<hyper-parameter-1 name>': [<value1>, <value2>, ...]},
    {'<hyper-parameter-2 name>': [<value1>, <value2>, ...]},
    ...
]

In the code below, we first create a regular model pipeline with Ridge regression as the output step and call it 'ridge'. A grid search can consist of hyper-parameters for different steps in a pipeline. So, we clarify this by using syntax <step name>__<hyper-parameter> as the A grid search can consist of hyper-parameters for different steps in a pipeline. So, we clarify this by using syntax <step name>__<hyper-parameter> as the values of the dictionaries. That is why you see the hyper-parameter in the grid below having the name ridge__alpha. Next, we build the GridSearchCV with the optional inputs cv as the number of CV folds, scoring to select the metric to use, and return_train_score=True so that we can observe the performances of all hyper-parameter combinations if needed. Finally, we call fit() to start the tuning process.

Examining the search result

After tuning, we use the property best_estimator_ of the grid search object to obtain the best model which has α = 0.05 (you can expand the Ridge step below to see this). Additionally, the best_score_ property records the best CV performance, which is CV R2 of 0.863 in this case. It is equal to the CV R2 of us manually fit a Ridge model with α = 0.05 as you can see. The default Ridge model (α = 1) in this split of data yields a CV R2 of 0.86, so we have a slight improvement by tuning. Note that this is only the case in this current data set. In others, not tuning may give you very poor model performances.

In [19]:

grid_search.best_estimator_

Out[19]:

Pipeline(steps=[('processing',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('polynomial',
                                                                   PolynomialFeatures()),
                                                                  ('standardize',
                                                                   StandardScaler())]),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                                   'acceleration', 'year']),
                                                 ('class',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder())]),
                                                  ['origin'])])),
                ('ridge', Ridge(alpha=0.05))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('processing',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('polynomial',
                                                                   PolynomialFeatures()),
                                                                  ('standardize',
                                                                   StandardScaler())]),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                                   'acceleration', 'year']),
                                                 ('class',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder())]),
                                                  ['origin'])])),
                ('ridge', Ridge(alpha=0.05))])

processing: ColumnTransformer

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='median')),
                                                 ('polynomial',
                                                  PolynomialFeatures()),
                                                 ('standardize',
                                                  StandardScaler())]),
                                 ['cylinders', 'displacement', 'horsepower',
                                  'weight', 'acceleration', 'year']),
                                ('class',
                                 Pipeline(steps=[('encoder', OneHotEncoder())]),
                                 ['origin'])])

numeric

['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']

SimpleImputer

SimpleImputer(strategy='median')

PolynomialFeatures

PolynomialFeatures()

StandardScaler

StandardScaler()

class

['origin']

OneHotEncoder

OneHotEncoder()

Ridge

Ridge(alpha=0.05)

In [20]:

grid_search.best_score_

Out[20]:

0.8633998630137383

In [23]:

best_ridge_reg = grid_search.best_estimator_

r2_10cv = cross_val_score(best_ridge_reg, train, train[[target]], cv=10, scoring='r2')
np.mean(r2_10cv)

Out[23]:

0.8633998630137383

Inferencing with the best model

best_estimator_ gives us the model with the best set of hyper-parameters. Therefore, we can use it to make inferences like making prediction on new data, or get the final model evaluation in the test data. Below are two examples of doing that. As you can see, the R2 in the testing data is 0.856 which is not far from that of the training data. So, it is safe to say that our model is not overfitting anything.

In [24]:

best_ridge_reg = grid_search.best_estimator_

testY_pred = best_ridge_reg.predict(test)

In [25]:

from sklearn.metrics import r2_score

r2_score(test[[target]], testY_pred)

Out[25]:

0.8558428749586002

Conclusion

In this post, we discussed and did some hands-on on model tuning. This is a very important process in data analytics and machine learning. So, please take your time to really understand the related concepts like hyper-parameters and grid-search. For now though, I will conclude this post here. See you next time!

Model Tuning