After the first step of looking at the most basic regression problem, hopefully you have had a good understand about this type of analysis and its approach. Therefore, it is now the time for us to take a further step into the predictive analysis world. In this post, we will still be working with a linear regression model, but it is neither simple or basic. Specifically, we will talk about building a linear regression pipeline on a complete data set (of more than one feature!). So, let us jump in!
Data exploration
The complete notebook is available here. We will continue with the students1000.csv
we used in the showcase of the complete data pipeline. In fact, the processing part is almost identical as we did back then. First, we import libraries then immediately split the data into training train
and testing test
. Then, we perform some exploratory analysis train
, starting with info()
which shows some missing data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('students1000.csv')
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)
train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 800 entries, 762 to 648 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 800 non-null int64 1 FirstName 800 non-null object 2 LastName 800 non-null object 3 Major 800 non-null object 4 HighSchoolGPA 800 non-null float64 5 FamilyIncome 800 non-null int64 6 State 788 non-null object 7 AvgDailyStudyTime 786 non-null float64 8 TotalAbsence 793 non-null float64 9 FirstYearGPA 800 non-null float64 dtypes: float64(4), int64(2), object(4) memory usage: 68.8+ KB
Next, for convenience, we create one list for numeric columns and one for categorical columns. Like we discussed quite some times ago, StudentID
, FirstName
, and LastName
are not likely related to students’ performance, so we will not investigate them further. Then checking their histograms and bar charts. Not too much issues here besides the skewness in FamilyIncome
and a few rare classes in State
. We now have enough information to move on to preprocessing.
num_cols = ['HighSchoolGPA','FamilyIncome','AvgDailyStudyTime','TotalAbsence']
cat_cols = ['Major', 'State']
target = 'FirstYearGPA'
import matplotlib.pyplot as plt
train[num_cols].hist(bins=20, figsize=(8,8))
plt.show()
for col in cat_cols:
print(col)
train[col].value_counts().plot.bar(rot=30, figsize=(4,4))
plt.show()
Major
State
Processing pipeline
Based on what we learn from the small exploratory analysis, we will build the pipeline as follows:
– FamilyIncome
: log transform -> standardization -> imputation
– HighSchoolGPA
, AvgDailyStudyTime
, and TotalAbsence
: standardization -> imputation
– Major
and State
: one hot encoder
The pipeline construction is SKLearn is as below. This time, we will stop at building the pipeline without fitting or transforming with it. The reason is that, we will add our linear regression model on top of this pipeline next. The final pipeline in this step is processing_pipeline
.
log_cols = ['FamilyIncome']
num_cols = ['HighSchoolGPA','AvgDailyStudyTime','TotalAbsence']
cat_cols = ['Major','State']
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#regular pipeline for HighSchoolGPA, AvgDailyStudyTime, and TotaAbsence
num_pipeline = Pipeline([
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#log pipeline with log transformation added for FamilyIncome
def log_transform(data):
return np.log(data)
log_pipeline = Pipeline([
('log transform', FunctionTransformer(log_transform)),
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
#categorical pipeline for Major and State
cat_pipeline = Pipeline([
('encode', OneHotEncoder(max_categories=5, handle_unknown='infrequent_if_exist'))
])
#combine all in a single pipeline
processing_pipeline = ColumnTransformer([
('log trans', log_pipeline, log_cols),
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
Linear regression model
Building pipeline
With how much we discussed about linear regression previously, you may be very surprise here. It takes another three Python statements to add our model on top of the processing pipeline and train it. Specifically, after importing the mode, we create another Pipeline with step one being the processing_pipeline
and step two being a new LinearRegression
model. Finally, we call fit()
and give it the train
data as well as the target to train the pipeline and the model. And that is it. Now, we have a fully trained pipeline that performs preprocessing any input data then performing linear regression and giving us the predictions. SKLearn even visualizes the complete modeling pipeline for use as you can see below!
from sklearn.linear_model import LinearRegression
linear_reg_pipeline = Pipeline([
('processing', processing_pipeline),
('modeling', LinearRegression())
])
linear_reg_pipeline.fit(train, train[[target]])
Pipeline(steps=[('processing', ColumnTransformer(transformers=[('log trans', Pipeline(steps=[('log ' 'transform', FunctionTransformer(func=<function log_transform at 0x0000017905B373A0>)), ('standardize', StandardScaler()), ('impute', SimpleImputer(strategy='median'))]), ['FamilyIncome']), ('numeric', Pipeline(steps=[('standardize', StandardScaler()), ('impute', SimpleImputer(strategy='median'))]), ['HighSchoolGPA', 'AvgDailyStudyTime', 'TotalAbsence']), ('class', Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='infrequent_if_exist', max_categories=5))]), ['Major', 'State'])])), ('modeling', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('processing', ColumnTransformer(transformers=[('log trans', Pipeline(steps=[('log ' 'transform', FunctionTransformer(func=<function log_transform at 0x0000017905B373A0>)), ('standardize', StandardScaler()), ('impute', SimpleImputer(strategy='median'))]), ['FamilyIncome']), ('numeric', Pipeline(steps=[('standardize', StandardScaler()), ('impute', SimpleImputer(strategy='median'))]), ['HighSchoolGPA', 'AvgDailyStudyTime', 'TotalAbsence']), ('class', Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='infrequent_if_exist', max_categories=5))]), ['Major', 'State'])])), ('modeling', LinearRegression())])
ColumnTransformer(transformers=[('log trans', Pipeline(steps=[('log transform', FunctionTransformer(func=<function log_transform at 0x0000017905B373A0>)), ('standardize', StandardScaler()), ('impute', SimpleImputer(strategy='median'))]), ['FamilyIncome']), ('numeric', Pipeline(steps=[('standardize', StandardScaler()), ('impute', SimpleImputer(strategy='median'))]), ['HighSchoolGPA', 'AvgDailyStudyTime', 'TotalAbsence']), ('class', Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='infrequent_if_exist', max_categories=5))]), ['Major', 'State'])])
['FamilyIncome']
FunctionTransformer(func=<function log_transform at 0x0000017905B373A0>)
StandardScaler()
SimpleImputer(strategy='median')
['HighSchoolGPA', 'AvgDailyStudyTime', 'TotalAbsence']
StandardScaler()
SimpleImputer(strategy='median')
['Major', 'State']
OneHotEncoder(handle_unknown='infrequent_if_exist', max_categories=5)
LinearRegression()
Model evaluation
What do we do after training a model? Of course, evaluating it. I do want to emphasize that, at this point, we still have not learned to properly evaluate any predictive models, so do not just stop here and try to predict stock prices with linear regression! Instead, we will perform basic evaluation on the training data. If you remember, we can calculate the training MSE, which is 0.043
, and RMSE, 0.208
. So, on average, this model makes errors of 0.208 points from the true GPAs of the students in the training data. Not too bad, right? As it turns out, it is rather difficult to conclude using MSE and RMSE because they scale with the target. For example, if we try to predict incomes, MSE and RMSE can get to the levels of tens of thousands, which are just difficult to interpret.
from sklearn.metrics import mean_squared_error
#get the prediction
trainY_pred = linear_reg_pipeline.predict(train)
#get the MSE
mse_lr = mean_squared_error(train[[target]], trainY_pred)
print(mse_lr)
0.04336001595092426
np.sqrt(mse_lr)
0.20823067965822006
That brings us to another very commonly used metric for evaluating regression models, which is R2 (r-squared). The calculation of R2 is fairly more complicated than MSE and RMSE, so I will not discuss it here. In short, R2 represents the percentage of variation in the target that a model can explain, and it caps at 1.0. For example, an R2 of 0.9
means the model can explain 90%
of variation in the target, an R2 of 0.1
means the model can only explain 10%
. In the training data, our model get a 0.858
R2, meaning it can explain 85.8%
of variations in GPA. Now we know for sure, this model is fairly good!
from sklearn.metrics import r2_score
r2_lr = r2_score(train[[target]], trainY_pred)
print(r2_lr)
0.8578342151552371
Inference
If we like our model, we can now use it to make inferences on new data, which in this case, is simulated by our test
. Since we have had a pipeline, this turns out to be extremely easy. Just use transform() from our linear_reg_pipeline
, we will get the predictions for any new data! Since we still have the true target in the testing data, we can perform evaluation there too. The model MSE and R2 in the testing data are 0.042 and 0.846, which are still very good. As we will learn a bit later, this result is more appropriate for evaluating models, not those from the training data, but let us just leave that for later.
#get the MSE
testY_pred = linear_reg_pipeline.predict(test)
mse_lr_test = mean_squared_error(test[[target]], testY_pred)
print(mse_lr_test)
0.041509744923245606
r2_lr_test = r2_score(test[[target]], testY_pred)
print(r2_lr_test)
0.8464113906421408
Conclusion
This is quite a long post, so I probably should stop here. To sum up, we have gone through the process of building a pipeline from data preprocessing to applying a linear regression model at the end. If you have a good understanding about pipeline and linear regression, you should have no problems following here. Anyway, see you again soon!
Pingback: Cross-Validation - Data Science from a Practical Perspective
Pingback: Polynomial Regression - Data Science from a Practical Perspective