So far, we have discussed the task of classification analysis along with a few version of logistic regression. We have also talked a bit about evaluating a classification task with accuracy score. However, classification is a bit more complicated than regression in that accuracy sometimes cannot completely reflect whether a model is good or bad. For such reason, I will spend this post discussing more evaluation metrics for classification. As usual, we start simple with evaluating models in binary problems.
Data in demonstration
I will use a modified version of the maternal health risk
data set which is available on the UCI machine learning repository. I did a bit changes so that we have a binary classification problem that shows the disadvantage of accuracy score. The complete notebook is available in my GitHub.
A first look into the data is as below. We have Age
, SystolicBP
, DiastolicBP
, BS
, BodyTemp
, and HeartRate
of the mothers at labor, and AtRisk
as the binary target indicating whether they were at risk of maternal mortality (1
) or not (0
). All columns are numeric.
import pandas as pd
import numpy as np
data = pd.read_csv('Maternal Health Risk Data Set.csv')
data.head(n=2)
Age | SystolicBP | DiastolicBP | BS | BodyTemp | HeartRate | AtRisk | |
---|---|---|---|---|---|---|---|
0 | 16 | 120 | 75 | 7.9 | 98.0 | 70 | 0 |
1 | 16 | 120 | 75 | 7.9 | 98.0 | 70 | 0 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1014 entries, 0 to 1013 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1014 non-null int64 1 SystolicBP 1014 non-null int64 2 DiastolicBP 1014 non-null int64 3 BS 1014 non-null float64 4 BodyTemp 1014 non-null float64 5 HeartRate 1014 non-null int64 6 AtRisk 1014 non-null int64 dtypes: float64(2), int64(5) memory usage: 55.6 KB
Next, we perform a train-test split and examine the histograms. Overall, there are some skewness but nothing too serious. Because we are not optimizing any models in this post, let us just perform a scaling on all features for preprocessing. After this point, we have the data ready for modeling.
from sklearn.model_selection import train_test_split
target = 'AtRisk'
X = data.drop(target, axis=1)
y = data[target]
trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2)
import matplotlib.pyplot as plt
trainX.hist(figsize=(6,8), bins=10)
plt.show()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
trainX_prc = scaler.fit_transform(trainX)
testX_prc = scaler.transform(testX)
Logistic model
To simplify the discussion on evaluation metrics, I am just gonna use a L2-regularized logistic regression. Though, let us still finetune it to avoid a very bad model. The progress here is pretty standard – create a parameter grid, a model, a grid search which is then fitted. Next, we extract the best logistic model with the best_estimator_
property from the grid search, and make predictions for the training and testing data as trainY_pred
and testY_pred
. For the training data, we use cross_val_predict
to avoid overfitted results. You may also notice that I create a trainY_0
. This is an array of 0
, a dummy prediction that assumes everything to belong to class 0
.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]
logistic = LogisticRegression(penalty='l2', max_iter=5000)
grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)
grid_search.fit(trainX_prc,trainY)
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000), param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]}], return_train_score=True, scoring='accuracy')
best_logistic = grid_search.best_estimator_
from sklearn.model_selection import cross_val_predict
trainY_pred = cross_val_predict(best_logistic, trainX_prc, trainY, cv=5)
trainY_0 = np.zeros(trainY.shape[0])
testY_pred = best_logistic.predict(testX_prc)
Evaluation metrics for binary classification
At this point, we are ready to discuss some evaluation metrics for binary classification. I will import all the metrics first for convenience.
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
Accuracy score
Let us review the accuracy score. The CV accuracy in the training data is 85.45%
and testing data 82.76%
, not bad right? Maybe, but also, maybe not.
accuracy_score(trainY, trainY_pred)
0.8545006165228114
accuracy_score(testY, testY_pred)
0.8275862068965517
The problem with accuracy is that, in this data, the proportion of classes in the target is unbalanced. More specifically, the rate of one class in the target is much higher/lower than that of the other. In this training data, 73.24%
has class 0
. This means, I can predict that all patients belong to class 0
and get that 73.24%
accuracy without any models. This is evidenced in the test below.
accuracy_score(trainY, trainY_0)
0.7324290998766955
Is accuracy still a good measurement to evaluate our model? We went through all the troubles to get only about 12%
better accuracy than a random guess. That certainly does not sound very good. So, let dig a bit deeper into the predictions and the true values with the confusion table.
Confusion table
In binary classification, we usually refer to the class of interest in the target as positive, and the other class negative. In this case, we can consider AtRisk
values of 1
as positive, and 0
negative. A confusion table (also called confusion matrix) provides these following four numbers:
1. True Positive (TP): number of rows of which actual values are positive and also get predicted as positive
2. True Negative (TN): number of rows of which actual values are negative and also get predicted as negative
3. False Positive (FP): number of rows of which actual values are negative but get predicted as positive
4. False Negative (FN): number of rows of which actual values are positive but get predicted as negative
In SKLearn, we use the confusion_matrix()
function to obtain these numbers like the example below. The location for each category is as on the right side (only in SKLearn, other tools may order them differently). Overall, in the training data, 569
rows are correctly predicted as negative, 124
rows are correctly predicted as positive, 25
misclassified from negative to positive, and 93
positive to negative.
confusion_matrix(trainY, trainY_pred)
array([[569, 25], [ 93, 124]], dtype=int64)
True Negative | False Positive |
False Negative | True Positive |
How about the confusion table of the dummy prediction? As everything gets predicted as 0
, we have 594 TN
, 217 FN
, 0 TP
, and 0 FP
. Now, we start seeing some issues with this type of predictions. Specifically, there are 0
rows detected as positive which is the class of interests. Next, let us discuss two measurements that really signify this problem.
confusion_matrix(trainY, trainY_0)
array([[594, 0], [217, 0]], dtype=int64)
Precision and Recall
These are the two measurements focusing on evaluate the prediction quality for the positive class. Precision measures the rate of rows predicted as positive are indeed positive, and Recall the rate of positive rows are predicted as positive. Their formulas are as follows.
In SKLearn, we use precision_score()
and recall_score()
to obtain these two measurements. The precision and recall of our L2 logistic model in the training data are 0.832
and 0.571
, respectively. The dummy prediction, however, gets a big warning about precision not defined (since its denominator TP+FP
is 0
) and a 0
recall. Now, we see a very big difference between having and not having a model. However, examining two metrics could be inconvenient at times. So, let us talk about a last measurement, the F1 score.
precision_score(trainY, trainY_pred)
0.8322147651006712
recall_score(trainY, trainY_pred)
0.5714285714285714
precision_score(trainY, trainY_0)
C:\Python\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
0.0
recall_score(trainY, trainY_0)
0.0
F1 score
F1 score is a combination of precision and recall. It ranges from 0
to 1
, the closer to 1
, the better a model, vice versa, the closer to 0
, the worse a model. Its formula is as follows.
In SKLearn, we use f1_score()
to obtain this measurement. Below are the some examples. The F1 scores of our model in the training data is 0.678
and testing data 0.607
. In contrast, the dummy prediction gets a 0
F1 in the training data, correctly reflecting that it does nothing useful.
f1_score(trainY, trainY_pred)
0.6775956284153005
f1_score(trainY, trainY_0)
0.0
f1_score(testY, testY_pred)
0.6067415730337079
So why don’t we just use F1 instead of accuracy all the times? Well, you certainly can. It is really up to your preferences. However, accuracy is not that useless. For one thing, it is a lot easier to understand than F1 score. An accuracy of 90% means that your model assigned the correct class to 90% of the instances. On the other hand, a F1 of 0.9 is more difficult to interpret other than you have a very good model. Furthermore, in data that is more balanced, accuracy is totally okay to use.
Conclusion
In this post, we have discussed several evaluation metrics for binary classification models. There are still a lot more, but these are fairly common that you are more likely to see when browsing around. We will probably explore more metrics in the future, but for now, accuracy and F1 are plenty to use. So, I will conclude this post here. See you next time!