an illustration of training and inferring using a linear regression model

After the first step of looking at the most basic regression problem, hopefully you have had a good understand about this type of analysis and its approach. Therefore, it is now the time for us to take a further step into the predictive analysis world. In this post, we will still be working with a linear regression model, but it is neither simple or basic. Specifically, we will talk about building a linear regression pipeline on a complete data set (of more than one feature!). So, let us jump in!

Data exploration

The complete notebook is available here. We will continue with the students1000.csv we used in the showcase of the complete data pipeline. In fact, the processing part is almost identical as we did back then. First, we import libraries then immediately split the data into training train and testing test. Then, we perform some exploratory analysis train, starting with info() which shows some missing data.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('students1000.csv')

In [2]:

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

In [3]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 762 to 648
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   StudentID          800 non-null    int64  
 1   FirstName          800 non-null    object 
 2   LastName           800 non-null    object 
 3   Major              800 non-null    object 
 4   HighSchoolGPA      800 non-null    float64
 5   FamilyIncome       800 non-null    int64  
 6   State              788 non-null    object 
 7   AvgDailyStudyTime  786 non-null    float64
 8   TotalAbsence       793 non-null    float64
 9   FirstYearGPA       800 non-null    float64
dtypes: float64(4), int64(2), object(4)
memory usage: 68.8+ KB

Next, for convenience, we create one list for numeric columns and one for categorical columns. Like we discussed quite some times ago, StudentID, FirstName, and LastName are not likely related to students’ performance, so we will not investigate them further. Then checking their histograms and bar charts. Not too much issues here besides the skewness in FamilyIncome and a few rare classes in State. We now have enough information to move on to preprocessing.

In [4]:

num_cols = ['HighSchoolGPA','FamilyIncome','AvgDailyStudyTime','TotalAbsence']
cat_cols = ['Major', 'State']
target = 'FirstYearGPA'

In [5]:

import matplotlib.pyplot as plt

train[num_cols].hist(bins=20, figsize=(8,8))
plt.show()

In [6]:

for col in cat_cols:
    print(col)
    train[col].value_counts().plot.bar(rot=30, figsize=(4,4))
    plt.show()

Major

State

Processing pipeline

Based on what we learn from the small exploratory analysis, we will build the pipeline as follows:
– FamilyIncome: log transform -> standardization -> imputation
– HighSchoolGPA, AvgDailyStudyTime, and TotalAbsence: standardization -> imputation
– Major and State: one hot encoder
The pipeline construction is SKLearn is as below. This time, we will stop at building the pipeline without fitting or transforming with it. The reason is that, we will add our linear regression model on top of this pipeline next. The final pipeline in this step is processing_pipeline.

In [7]:

log_cols = ['FamilyIncome']
num_cols = ['HighSchoolGPA','AvgDailyStudyTime','TotalAbsence']
cat_cols = ['Major','State']

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

#regular pipeline for HighSchoolGPA, AvgDailyStudyTime, and TotaAbsence
num_pipeline = Pipeline([
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

#log pipeline with log transformation added for FamilyIncome
def log_transform(data):
    return np.log(data)

log_pipeline = Pipeline([
    ('log transform', FunctionTransformer(log_transform)),
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

#categorical pipeline for Major and State
cat_pipeline = Pipeline([
    ('encode', OneHotEncoder(max_categories=5, handle_unknown='infrequent_if_exist'))
])

#combine all in a single pipeline

processing_pipeline = ColumnTransformer([
    ('log trans', log_pipeline, log_cols),
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

Linear regression model

Building pipeline

With how much we discussed about linear regression previously, you may be very surprise here. It takes another three Python statements to add our model on top of the processing pipeline and train it. Specifically, after importing the mode, we create another Pipeline with step one being the processing_pipeline and step two being a new LinearRegression model. Finally, we call fit() and give it the train data as well as the target to train the pipeline and the model. And that is it. Now, we have a fully trained pipeline that performs preprocessing any input data then performing linear regression and giving us the predictions. SKLearn even visualizes the complete modeling pipeline for use as you can see below!

Model evaluation

What do we do after training a model? Of course, evaluating it. I do want to emphasize that, at this point, we still have not learned to properly evaluate any predictive models, so do not just stop here and try to predict stock prices with linear regression! Instead, we will perform basic evaluation on the training data. If you remember, we can calculate the training MSE, which is 0.043, and RMSE, 0.208. So, on average, this model makes errors of 0.208 points from the true GPAs of the students in the training data. Not too bad, right? As it turns out, it is rather difficult to conclude using MSE and RMSE because they scale with the target. For example, if we try to predict incomes, MSE and RMSE can get to the levels of tens of thousands, which are just difficult to interpret.

In [9]:

from sklearn.metrics import mean_squared_error

#get the prediction
trainY_pred = linear_reg_pipeline.predict(train)

#get the MSE
mse_lr = mean_squared_error(train[[target]], trainY_pred)
print(mse_lr)

0.04336001595092426

In [10]:

np.sqrt(mse_lr)

Out[10]:

0.20823067965822006

That brings us to another very commonly used metric for evaluating regression models, which is R2 (r-squared). The calculation of R2 is fairly more complicated than MSE and RMSE, so I will not discuss it here. In short, R2 represents the percentage of variation in the target that a model can explain, and it caps at 1.0. For example, an R2 of 0.9 means the model can explain 90% of variation in the target, an R2 of 0.1 means the model can only explain 10%. In the training data, our model get a 0.858 R2, meaning it can explain 85.8% of variations in GPA. Now we know for sure, this model is fairly good!

In [11]:

from sklearn.metrics import r2_score

r2_lr = r2_score(train[[target]], trainY_pred)
print(r2_lr)

0.8578342151552371

Inference

If we like our model, we can now use it to make inferences on new data, which in this case, is simulated by our test. Since we have had a pipeline, this turns out to be extremely easy. Just use transform() from our linear_reg_pipeline, we will get the predictions for any new data! Since we still have the true target in the testing data, we can perform evaluation there too. The model MSE and R2 in the testing data are 0.042 and 0.846, which are still very good. As we will learn a bit later, this result is more appropriate for evaluating models, not those from the training data, but let us just leave that for later.

In [12]:

#get the MSE
testY_pred = linear_reg_pipeline.predict(test)

mse_lr_test = mean_squared_error(test[[target]], testY_pred)
print(mse_lr_test)

0.041509744923245606

In [13]:

r2_lr_test = r2_score(test[[target]], testY_pred)
print(r2_lr_test)

0.8464113906421408

Conclusion

This is quite a long post, so I probably should stop here. To sum up, we have gone through the process of building a pipeline from data preprocessing to applying a linear regression model at the end. If you have a good understanding about pipeline and linear regression, you should have no problems following here. Anyway, see you again soon!

Linear Regression Model