Linear Regression Model

an illustration of training and inferring using a linear regression model

After the first step of looking at the most basic regression problem, hopefully you have had a good understand about this type of analysis and its approach. Therefore, it is now the time for us to take a further step into the predictive analysis world. In this post, we will still be working with a linear regression model, but it is neither simple or basic. Specifically, we will talk about building a linear regression pipeline on a complete data set (of more than one feature!). So, let us jump in!

Data exploration

The complete notebook is available here. We will continue with the students1000.csv we used in the showcase of the complete data pipeline. In fact, the processing part is almost identical as we did back then. First, we import libraries then immediately split the data into training train and testing test. Then, we perform some exploratory analysis train, starting with info() which shows some missing data.

Next, for convenience, we create one list for numeric columns and one for categorical columns. Like we discussed quite some times ago, StudentID, FirstName, and LastName are not likely related to students’ performance, so we will not investigate them further. Then checking their histograms and bar charts. Not too much issues here besides the skewness in FamilyIncome and a few rare classes in State. We now have enough information to move on to preprocessing.

Processing pipeline

Based on what we learn from the small exploratory analysis, we will build the pipeline as follows:
FamilyIncome: log transform -> standardization -> imputation
HighSchoolGPA, AvgDailyStudyTime, and TotalAbsence: standardization -> imputation
Major and State: one hot encoder
The pipeline construction is SKLearn is as below. This time, we will stop at building the pipeline without fitting or transforming with it. The reason is that, we will add our linear regression model on top of this pipeline next. The final pipeline in this step is processing_pipeline.

Linear regression model

Building pipeline

With how much we discussed about linear regression previously, you may be very surprise here. It takes another three Python statements to add our model on top of the processing pipeline and train it. Specifically, after importing the mode, we create another Pipeline with step one being the processing_pipeline and step two being a new LinearRegression model. Finally, we call fit() and give it the train data as well as the target to train the pipeline and the model. And that is it. Now, we have a fully trained pipeline that performs preprocessing any input data then performing linear regression and giving us the predictions. SKLearn even visualizes the complete modeling pipeline for use as you can see below!

Model evaluation

What do we do after training a model? Of course, evaluating it. I do want to emphasize that, at this point, we still have not learned to properly evaluate any predictive models, so do not just stop here and try to predict stock prices with linear regression! Instead, we will perform basic evaluation on the training data. If you remember, we can calculate the training MSE, which is 0.043, and RMSE, 0.208. So, on average, this model makes errors of 0.208 points from the true GPAs of the students in the training data. Not too bad, right? As it turns out, it is rather difficult to conclude using MSE and RMSE because they scale with the target. For example, if we try to predict incomes, MSE and RMSE can get to the levels of tens of thousands, which are just difficult to interpret.

That brings us to another very commonly used metric for evaluating regression models, which is R2 (r-squared). The calculation of R2 is fairly more complicated than MSE and RMSE, so I will not discuss it here. In short, R2 represents the percentage of variation in the target that a model can explain, and it caps at 1.0. For example, an R2 of 0.9 means the model can explain 90% of variation in the target, an R2 of 0.1 means the model can only explain 10%. In the training data, our model get a 0.858 R2, meaning it can explain 85.8% of variations in GPA. Now we know for sure, this model is fairly good!

Inference

If we like our model, we can now use it to make inferences on new data, which in this case, is simulated by our test. Since we have had a pipeline, this turns out to be extremely easy. Just use transform() from our linear_reg_pipeline, we will get the predictions for any new data! Since we still have the true target in the testing data, we can perform evaluation there too. The model MSE and R2 in the testing data are 0.042 and 0.846, which are still very good. As we will learn a bit later, this result is more appropriate for evaluating models, not those from the training data, but let us just leave that for later.

Conclusion

This is quite a long post, so I probably should stop here. To sum up, we have gone through the process of building a pipeline from data preprocessing to applying a linear regression model at the end. If you have a good understanding about pipeline and linear regression, you should have no problems following here. Anyway, see you again soon!

2 Comments

Comments are closed