Hopefully, after the previous post, you have had a good idea of what regression analysis is. With that, we will start looking into the big collection of regression models. And like learning anything else, we start basic, as basic as possible, to have a grasp on the problem first. So, in this post, I will introduce you to the simple linear regression model, the most elementary case and method in regression analysis. So, let us start right away!

Simple Linear Regression

As we now know, regression is the problem of analyzing the relationship between numeric targets and other features in data. Narrowing it down to the most basic case, simple means that we analyze data with a single feature and a single target. And, linear means that we assume that the relationship between the feature and the target is linear. It means that the relationship is representable by a linear function y = ax + b with y being the target, x being the feature, and a and b being constant numbers.

To solve a simple linear regression task, we train a linear regression model. Long story short, it means we ask SKLearn to learn the constants a and b in the equation y = ax + b for us. And, the learned equation is the representation of the linear model that we obtain from the analysis. We usually refer to a as the model coefficient, and b the model intercept.

If you remember, in the previous post, we illustrated a regression analysis with the problem of predicting test scores using study times. This is a simple linear regression task. First, there are a single feature, studytime, and a single target testscore. Furthermore, the equation in our example is testscore = 53 + 5*studytime which is a linear equation. So, testscore = 53 + 5*studytime is the linear regression model for the given problem.

Data to demonstrate

In this post, I will use the study-score.csv file. It has only 20 rows and two columns, studytime and testscore. And, we still try to build a linear regression model that estimates testscore based on studytime. The complete Jupyter notebook for this post is available here. Now let us start.

Just like usual, we begin with importing libraries then loading the data. Do you remember that train test splitting is pretty much mandatory in predictive analysis? Well, we will not do that here because this is just a small illustrative analysis. Anyway, back to our task. Next, we check info() and draw histograms of the columns. Because this is a regression task, we also draw a scatter plot using plot.scatter() with two options, x= for the feature, and y= for the target of our analysis. The scatter plot is to examine the correlation between our target and feature.

study-score.csv Download

In [13]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('study-score.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   studytime  19 non-null     float64
 1   testscore  19 non-null     int64  
dtypes: float64(1), int64(1)
memory usage: 432.0 bytes

In [3]:

data.hist(bins=6, figsize=(7,3))
plt.show()

In [4]:

data.plot.scatter(x='studytime', y='testscore', figsize=(3,3))
plt.show()

In terms of result, info() shows no issues with missing data. The histograms are quite jaggy because of the tiny data size, but still suggest enough symmetricity. Overall, we do not have to perform any processing for this data. Now, let us look at the scatter plot. The pattern resemble a straight line, so we do have a strong linear correlation here. This is an indicator of using a linear regression is a good idea. So, let us move on and do that.

Fitting a simple linear regression model

How do we do that? Of course, we ask SKLearn! The model class to import is LinearRegression, which is simple enough. But we have to do one thing first. And that is to split the feature and the target into separate variables. While doing this is not mandatory, it will ease our job a lot later on. Plus, it is super easy to do, so let us just do that now. I will call the feature dataX and the target dataY. After splitting features and targets, we can now create our model linear_reg and train it with fit(). As a linear model, fit() takes two inputs, feature dataX, and target dataY. After the call to fit(), we have successfully trained our linear_reg model!

In [5]:

from sklearn.linear_model import LinearRegression

dataX = data[['studytime']]
dataY = data[['testscore']]

linear_reg = LinearRegression()
linear_reg.fit(dataX, dataY)

Out[5]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

So where is the equation y = ax+b that we talked about? They are in our linear_reg model. We can get a and b using the property coef_ and intercept_ like below. We can round the numbers and write our equation now as testscore = 7.37*studytime + 23.17.

In [6]:

linear_reg.intercept_

Out[6]:

array([23.17244975])

In [7]:

linear_reg.coef_

Out[7]:

array([[7.37410507]])

Model evaluation

What comes after training? Evaluating! However, model evaluation is not really a simple task and will require a few posts to discuss its basic. So, for now, we will settle with evaluating this model by checking if it makes good predictions in the given data.

And, that bring us to an evaluation metric that is called Mean Squared Error (MSE). It is the average squared errors between the true target and the predicted target. Calculating this is fairly easy. First, we use the trained model to makes predictions on the data. Next, we get the difference between the true target and the predicted one, square them, then averaging them. The final result is the MSE of the model on the given data. Below is one illustrative example.

an example of calculating mean square error

If you are wondering why do we square the errors, it is because errors can be either positive or negative. So, averaging them, most of the times, will yield a result very close to 0 which is misleading. Therefore, we square them before taking their mean. MSE is an error metric, so lower MSE means better predictions. However, you can only compare MSE of models in similar data!

In SKLearn

As easy as it is, SKLearn still has a function for us! We will use the mean_square_error() function imported from sklearn.metrics. This function takes two inputs, the true target, and the predicted version. Therefore, we will first generate the predicte data predY first using predict() from linear_reg with the input being the feature dataX. And finally, we can print the value of MSE of this model on the given data, which is 64.4 in this case.

In [14]:

predY = linear_reg.predict(dataX)

from sklearn.metrics import mean_squared_error, r2_score

print('Mean squared error:', mean_squared_error(dataY, predY))

Mean squared error: 64.36932608588066

For easier interpretation, we can take its square root which is around 8.02, which is called the Root Mean Squared Error – RMSE. An RMSE of 8.02 means that on average, this model predicts about 8.02 points incorrectly from the true test scores. Do you think that is good? Probably, given that test scores range from 20 to 100, prediction with an average error of 8.02 is quite okay!

Scatter plot with linear line

I hope that you remember that a linear equation y = ax + b can be drawn as a straight line on a two-axis coordinate system. Why? Because we can totally visualize this model on the same scatter plot of the feature and the target. As we already create the prediction as predY, simply use pyplot.plot() to draw the prediction line. We also draw the scatter plot using pyplot.scatter(). Both functions take the first input as the feature, and second input the true or predicted label. As you can see, our prediction line lies very nicely in the middle of the points and captures the most dominant pattern. This is also why a straight-line pattern in scatter plots indicates a strong linear correlation.

In [10]:

plt.figure(figsize=(3,3))
plt.scatter(dataX, dataY)
plt.plot(dataX, predY, c='red')
plt.show()

Wrapping up

In this post, we have discussed the concept of simple linear regression, as well as attempted a small analysis with a linear model. So, what else? A lot! And a lot more! There are so many things to explore in data analytics, and we have just taken a tiny step. Regardless, let us stop here now, since that tiny step still takes some times to get absorbed. Let us return next time and continue with linear regression. See you again!