Remember how I said we had not learned to properly perform model evaluation a bit ago? Well, let us fix that today then! Recall, the point of predictive analysis is to learn knowledge in histories for uses in new data. To simulate that case, we use train-test splitting where training portions represent historical data and testing portion new data. Surely, we can use the testing data for evaluation purpose. However, evaluating models using one testing set is not as reliable due to the randomness of splitting. Furthermore, testing data is usually more like a “graduate” evaluation, we only let models try on the testing data once they perform well enough in the training data. To evaluate models for adjustment mid-training, we need a technique that is called cross-validation.
Data in demonstration
The complete notebook for this post is available here. As this is a continuation from the linear regression model, we use the same data set and processing pipeline. For completion purpose, I include the code to load and preprocess data below. However, please refer to the old post for the complete discussion on data processing.
Click to expandimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data = pd.read_csv('students1000.csv')
log_cols = ['FamilyIncome']
num_cols = ['HighSchoolGPA','AvgDailyStudyTime','TotalAbsence']
cat_cols = ['Major','State']
target = 'FirstYearGPA'
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#regular pipeline for HighSchoolGPA, AvgDailyStudyTime, and TotaAbsence
num_pipeline = Pipeline([
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#log pipeline with log transformation added for FamilyIncome
def log_transform(data):
return np.log(data)
log_pipeline = Pipeline([
('log transform', FunctionTransformer(log_transform)),
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#categorical pipeline for Major and State
cat_pipeline = Pipeline([
('encode', OneHotEncoder(max_categories=5, handle_unknown='infrequent_if_exist'))
])
processing_pipeline = ColumnTransformer([
('log trans', log_pipeline, log_cols),
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
Throughout the post, we will work with the linear regression pipeline linear_reg_pipeline
created below.
from sklearn.linear_model import LinearRegression
linear_reg_pipeline = Pipeline([
('processing', processing_pipeline),
('modeling', LinearRegression())
])
The randomness in train-test split
Let us first investigate the motivation for cross-validation. Below are two cells with identical codes. I just split the data into 80% training 20% testing then fitting the model and getting its R2 in both sets. You can clearly see that, while the codes are the same, the results are not. The training R2’s differ about 0.012
, and the testing, a whooping 0.043
! Of course, you can rerun these cells and observe how the values keep changing, not just these two values. The reason for this is due to the randomness in train-test split. While this is not the end of the world, it could cause some issues. For example, comparing two models to get the better one is not reliable with results fluctuating like this.
train, test = train_test_split(data, test_size=0.2)
linear_reg_pipeline.fit(train, train[[target]])
print('training r2:', linear_reg_pipeline.score(train,train[[target]]))
print('testing r2:', linear_reg_pipeline.score(test,test[[target]]))
training r2: 0.8471151657163838 testing r2: 0.8837116449064917
train, test = train_test_split(data, test_size=0.2)
linear_reg_pipeline.fit(train, train[[target]])
print('training r2:', linear_reg_pipeline.score(train,train[[target]]))
print('testing r2:', linear_reg_pipeline.score(test,test[[target]]))
training r2: 0.8592036109478167 testing r2: 0.840746003829508
Cross-validation
Since a single train-test split could be too random which leads to unreliable comparisons, an easy way to fix that is to split multiple times. One of the most common technique in doing this is called a k-fold cross-validation. In short, we first split a dataset is randomly k
subsets. Then, we perform the following k
times: rotating and using (k-1)
subsets for training and the remaining one for testing, also called validating. In each round, we evaluate our models and aggregate the result at the end.
Below is one illustrative example with k=4
. The data is first split into four portions. Then, we rotate the subsets in which three portions are used for training and the remaining for validation four times. After each round, we obtain one evaluation, for example, MSE, or R2. At the end, we average the results of the four evaluation rounds to conclude the model’s performance.
Cross-validation is better than a single train-test split simply because it repeat the process multiple times which offsets the effects of randomness. Of course, we will still obtain different results each time running the process, however, with high enough k
, the variation should be negligible. However, you should also be careful when selecting k
so that the testing data size is not two low. You probably will see most people use 10-fold cross-validation.
In SKLearn
In SKLearn, we use the cross_val_score()
function to perform cross-validation evaluation on a given model. The function takes three main inputs: the model (or model pipeline), the data, and the target. Optional inputs are cv
which is for the number of folds, and scoring
for the measurement we want to obtain. In the code below, we perform 5-fold cross-validation for our linear regression pipeline using neg_mean_squared_error
which is just the negative MSE. The reason it is negative is to become a score measurement instead of error measurement, i.e., higher means better. cross_val_score()
performs fit()
internally, so we do not to manually do that. The direct output of the function is a list of each round scores as you can see. To obtain the final evaluation, we can use numpy.mean()
. Finally, to get the cross-validation R2, we simply change scoring
to r2
.
from sklearn.model_selection import cross_val_score
#get the MSE
mse_lr_cv = cross_val_score(linear_reg_pipeline, train, train[[target]], cv=5, scoring='neg_mean_squared_error')
print(mse_lr_cv)
[-0.04016476 -0.04822953 -0.03718822 -0.04652823 -0.04540672]
mse_lr_cv = cross_val_score(linear_reg_pipeline, train, train[[target]], cv=5, scoring='neg_mean_squared_error')
print(-np.mean(mse_lr_cv))
0.04350349447495508
#get the R2
mse_lr_cv = cross_val_score(linear_reg_pipeline, train, train[[target]], cv=10, scoring='r2')
print(np.mean(mse_lr_cv))
0.8595072792118769
For only the purpose of evaluating a model, we do not really need to do train-test split. However, in the case of evaluating and adjusting a model mid-training, we split data and perform cross-validation in the training set. This is called finetuning or model tuning which we will discussed very soon.
Wrapping up
In this post, we have gone through the concept and hands-on for cross-validation. This is a very important technique in data analytics for model evaluation and model tuning. So, you really should take times to fully understand cross-validation. Next, we will return to regression and explore more interesting things to do. See you again then!