At this point, we have spent several posts talking about regression analysis. While it is surely not everything about regression, let us change the air a little bit and discuss classification analysis. Classification also belong to supervised learning, a branch of analytics that quantifies the associations of features to a target. However, unlike regression where targets are true numeric, classification deals with targets which are categories or classes. So, let us jump in and learn this interesting task right away!

Illustrative example of classification analysis

We will start the discussion with a very simple example of classification analysis. I will use a small data set, testresult.csv, that has two columns, StudyTime and Passed. Here, our task is to build a model that predicts if a student passed the test using their study time. You can access the complete notebook here. First, let us take a quick look into the data.

The data is clean so I will not use info(). Instead, we look at histograms. First, StudyTime ranges from just under one hour to a bit over nine hours. Passed is coded as numbers, 0 for failed, and 1 for passed, and their proportion is about 60-40, respectively. This is a typical case of binary classification where the target has exactly two unique classes. It is also very typical to encode the binary classes as 0 and 1 which are actually meaningful numbers. So, we can further look at the scatter plot of StudyTime and Passed as below. There is surely a strong correlation here, very low StudyTime certainly leads to a fail, and very high StudyTime guarantees a pass.

testresult.csv Download

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression

data = pd.read_csv('testresult.csv')
data.head(n=2)

Out[1]:

	StudyTime	Passed
0	1.000000	0
1	1.517177	0

In [2]:

data.hist(figsize=(10,4))
plt.show()

In [3]:

plt.scatter(data['StudyTime'],data['Passed'])
plt.show()

Attempting linear regression

Since the target is somewhat numeric, we can absolutely try linear regression. And, the model fits just fine. However, there are several reasons why linear regression is not good for binary classification. Some of them are very technical which I will not discuss here. Instead, let us observe its practical issues.

First, outputs from a linear model is unbounded, meaning it can go from very negative numbers to very positive ones. With our target being either 0 and 1, predictions like -140 or 331 are not too meaningful. Now, we can manually process the outputs, for example, set everything everything below 0.5 to 0, and above 0.5 to 1 to have suitable predictions. Still, the transition of values in between 0 and 1 is still too “rough” or “fast”. From the scatter plot, we can observe that the model outputs values up to 0.4, too close to the 0.5 threshold, when StudyTime has just reaches above 4 hours. This is too soon, because all students with these study time still failed.

In [4]:

linear_reg = LinearRegression()
linear_reg.fit(data[['StudyTime']], data['Passed'])

Out[4]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [5]:

line_x = np.linspace(1,9,500)
line_y = linear_reg.predict(line_x.reshape(-1,1))

plt.scatter(data['StudyTime'],data['Passed'])
plt.plot(line_x,line_y,c='red',linewidth=3)
plt.show()

So, let us try Logistic regression, a dedicated classification model instead.

Logistic regression

Logistic regression is developed for binary classification and is the model for this task. We will spend another post discuss the technical details of this model. Here, let us just observe its difference compared to linear models.

As you can see from the scatter plot, logistic regression fits a much smoother transition from 0 to 1. At around 4 hours of study time, the predictions are still very close to 0, and they approach 1 very fast around 7 hours. Furthermore, the output of a logistic model is bounded between 0 and 1, giving us an interpretation of probability. Specifically, the direct output of a logistic model is the probability of an instance belonging to class 1. So, we can assign class of 1 to those with probability above 0.5, and 0 those under. So, what is next? Of course, evaluating the model that we have just trained.

In [6]:

logistic_reg = LogisticRegression()
logistic_reg.fit(data[['StudyTime']], data['Passed'])

Out[6]:

LogisticRegression()

In [7]:

line_x = np.linspace(1,9,500)
line_y = logistic_reg.predict_proba(line_x.reshape(-1,1))[:,1]

plt.scatter(data['StudyTime'],data['Passed'])
plt.plot(line_x,line_y,c='red',linewidth=3)
plt.show()

Evaluation metrics

Classification has arguably more evaluation metrics than regression. Let us just get familiar with the most common one, accuracy score.

Accuracy

Accuracy score is the proportion of instances having correct predictions. Yes, literally, we count their number then divided by the total number of instances, and that is the accuracy score. Of course, you do not have to count it yourself because SKLearn has a function for it – accuracy_score(). We feed the actual target and its prediction to the function to get the accuracy value, which is 0.925, or 92.5% in this case. Accuracy is between 0 and 1, and values closer to 1 mean better predictions.

In [8]:

y_pred = logistic_reg.predict(data[['StudyTime']])

In [9]:

from sklearn.metrics import accuracy_score

accuracy_score(data['Passed'], y_pred)

Out[9]:

0.925

Conclusion

In this post, we get exposed to classification analysis, logistic regression, and two common evaluation metrics. In general, classification is just as important as regression in data analytics. So, a good understanding on this subject is very important. Next, we will discuss logistic regression in more technical details. So, see you again in the next one!

Classification Analysis