an illustration on evaluation metrics for binary classification models

So far, we have discussed the task of classification analysis along with a few version of logistic regression. We have also talked a bit about evaluating a classification task with accuracy score. However, classification is a bit more complicated than regression in that accuracy sometimes cannot completely reflect whether a model is good or bad. For such reason, I will spend this post discussing more evaluation metrics for classification. As usual, we start simple with evaluating models in binary problems.

Data in demonstration

I will use a modified version of the maternal health risk data set which is available on the UCI machine learning repository. I did a bit changes so that we have a binary classification problem that shows the disadvantage of accuracy score. The complete notebook is available in my GitHub.

A first look into the data is as below. We have Age, SystolicBP, DiastolicBP, BS, BodyTemp, and HeartRate of the mothers at labor, and AtRisk as the binary target indicating whether they were at risk of maternal mortality (1) or not (0). All columns are numeric.

In [1]:

import pandas as pd
import numpy as np

data = pd.read_csv('Maternal Health Risk Data Set.csv')
data.head(n=2)

Out[1]:

	Age	SystolicBP	DiastolicBP	BS	BodyTemp	HeartRate	AtRisk
0	16	120	75	7.9	98.0	70	0
1	16	120	75	7.9	98.0	70	0

In [2]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1014 entries, 0 to 1013
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          1014 non-null   int64  
 1   SystolicBP   1014 non-null   int64  
 2   DiastolicBP  1014 non-null   int64  
 3   BS           1014 non-null   float64
 4   BodyTemp     1014 non-null   float64
 5   HeartRate    1014 non-null   int64  
 6   AtRisk       1014 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 55.6 KB

Next, we perform a train-test split and examine the histograms. Overall, there are some skewness but nothing too serious. Because we are not optimizing any models in this post, let us just perform a scaling on all features for preprocessing. After this point, we have the data ready for modeling.

In [3]:

from sklearn.model_selection import train_test_split

target = 'AtRisk'
X = data.drop(target, axis=1)
y = data[target]
trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2)

In [4]:

import matplotlib.pyplot as plt
trainX.hist(figsize=(6,8), bins=10)
plt.show()

In [5]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

trainX_prc = scaler.fit_transform(trainX)
testX_prc = scaler.transform(testX)  

Logistic model

To simplify the discussion on evaluation metrics, I am just gonna use a L2-regularized logistic regression. Though, let us still finetune it to avoid a very bad model. The progress here is pretty standard – create a parameter grid, a model, a grid search which is then fitted. Next, we extract the best logistic model with the best_estimator_ property from the grid search, and make predictions for the training and testing data as trainY_pred and testY_pred. For the training data, we use cross_val_predict to avoid overfitted results. You may also notice that I create a trainY_0. This is an array of 0 , a dummy prediction that assumes everything to belong to class 0.

In [6]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]
logistic = LogisticRegression(penalty='l2', max_iter=5000)

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Out[6]:

GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000),
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100]}],
             return_train_score=True, scoring='accuracy')

In [7]:

best_logistic = grid_search.best_estimator_

In [8]:

from sklearn.model_selection import cross_val_predict

trainY_pred = cross_val_predict(best_logistic, trainX_prc, trainY, cv=5)
trainY_0 = np.zeros(trainY.shape[0])
testY_pred = best_logistic.predict(testX_prc)

Evaluation metrics for binary classification

At this point, we are ready to discuss some evaluation metrics for binary classification. I will import all the metrics first for convenience.

In [9]:

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

Accuracy score

Let us review the accuracy score. The CV accuracy in the training data is 85.45% and testing data 82.76%, not bad right? Maybe, but also, maybe not.

In [10]:

accuracy_score(trainY, trainY_pred)

Out[10]:

0.8545006165228114

In [12]:

accuracy_score(testY, testY_pred)

Out[12]:

0.8275862068965517

The problem with accuracy is that, in this data, the proportion of classes in the target is unbalanced. More specifically, the rate of one class in the target is much higher/lower than that of the other. In this training data, 73.24% has class 0. This means, I can predict that all patients belong to class 0 and get that 73.24% accuracy without any models. This is evidenced in the test below.

In [11]:

accuracy_score(trainY, trainY_0)

Out[11]:

0.7324290998766955

Is accuracy still a good measurement to evaluate our model? We went through all the troubles to get only about 12% better accuracy than a random guess. That certainly does not sound very good. So, let dig a bit deeper into the predictions and the true values with the confusion table.

Confusion table

In binary classification, we usually refer to the class of interest in the target as positive, and the other class negative. In this case, we can consider AtRisk values of 1 as positive, and 0 negative. A confusion table (also called confusion matrix) provides these following four numbers:
1. True Positive (TP): number of rows of which actual values are positive and also get predicted as positive
2. True Negative (TN): number of rows of which actual values are negative and also get predicted as negative
3. False Positive (FP): number of rows of which actual values are negative but get predicted as positive
4. False Negative (FN): number of rows of which actual values are positive but get predicted as negative

In SKLearn, we use the confusion_matrix() function to obtain these numbers like the example below. The location for each category is as on the right side (only in SKLearn, other tools may order them differently). Overall, in the training data, 569 rows are correctly predicted as negative, 124 rows are correctly predicted as positive, 25 misclassified from negative to positive, and 93 positive to negative.

In [13]:

confusion_matrix(trainY, trainY_pred)

Out[13]:

array([[569,  25],
       [ 93, 124]], dtype=int64)

True Negative	False Positive
False Negative	True Positive

How about the confusion table of the dummy prediction? As everything gets predicted as 0, we have 594 TN, 217 FN, 0 TP, and 0 FP. Now, we start seeing some issues with this type of predictions. Specifically, there are 0 rows detected as positive which is the class of interests. Next, let us discuss two measurements that really signify this problem.

In [14]:

confusion_matrix(trainY, trainY_0)

Out[14]:

array([[594,   0],
       [217,   0]], dtype=int64)

Precision and Recall

These are the two measurements focusing on evaluate the prediction quality for the positive class. Precision measures the rate of rows predicted as positive are indeed positive, and Recall the rate of positive rows are predicted as positive. Their formulas are as follows.

$precision = \dfrac{TP}{TP + FP}$

$recall= \dfrac{TP}{TP + FN}$

In SKLearn, we use precision_score() and recall_score() to obtain these two measurements. The precision and recall of our L2 logistic model in the training data are 0.832 and 0.571, respectively. The dummy prediction, however, gets a big warning about precision not defined (since its denominator TP+FP is 0) and a 0 recall. Now, we see a very big difference between having and not having a model. However, examining two metrics could be inconvenient at times. So, let us talk about a last measurement, the F1 score.

In [16]:

precision_score(trainY, trainY_pred)

Out[16]:

0.8322147651006712

In [19]:

recall_score(trainY, trainY_pred)

Out[19]:

0.5714285714285714

In [17]:

precision_score(trainY, trainY_0)

C:\Python\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Out[17]:

0.0

In [20]:

recall_score(trainY, trainY_0)

Out[20]:

0.0

F1 score

F1 score is a combination of precision and recall. It ranges from 0 to 1, the closer to 1, the better a model, vice versa, the closer to 0, the worse a model. Its formula is as follows.

$F1 = \dfrac{2\times precision \times recall}{precision + recall}$

In SKLearn, we use f1_score() to obtain this measurement. Below are the some examples. The F1 scores of our model in the training data is 0.678 and testing data 0.607. In contrast, the dummy prediction gets a 0 F1 in the training data, correctly reflecting that it does nothing useful.

In [22]:

f1_score(trainY, trainY_pred)

Out[22]:

0.6775956284153005

In [23]:

f1_score(trainY, trainY_0)

Out[23]:

0.0

In [24]:

f1_score(testY, testY_pred)

Out[24]:

0.6067415730337079

So why don’t we just use F1 instead of accuracy all the times? Well, you certainly can. It is really up to your preferences. However, accuracy is not that useless. For one thing, it is a lot easier to understand than F1 score. An accuracy of 90% means that your model assigned the correct class to 90% of the instances. On the other hand, a F1 of 0.9 is more difficult to interpret other than you have a very good model. Furthermore, in data that is more balanced, accuracy is totally okay to use.

Conclusion

In this post, we have discussed several evaluation metrics for binary classification models. There are still a lot more, but these are fairly common that you are more likely to see when browsing around. We will probably explore more metrics in the future, but for now, accuracy and F1 are plenty to use. So, I will conclude this post here. See you next time!