With the previous introduction to classification analysis, we can now discuss more about classification models. Since we have had some exposure to logistic regression last time, let us continue with its story. Logistic regression is pretty much the linear model of classification, both because characteristically, it is indeed linear, and also because it has similar complexity to linear regression. So, let us not wait further, and jump right in!

Logistic regression

First, let us recall, the equation of a simple linear model is y = ax + b, with y being the target, x being the single feature, and a and b are coefficient and intercept, respectively. This equation gives us the straight line pattern in a 2-dimensional visualization as in the left figure below. Furthermore, y is unbounded in linear regression. Previously, we have seen that logistic regression prediction patterns form an S curve like in the right figure below.

an illustration of a linear model prediction

Prediction of a linear regression model

an illustration of logistic model prediction pattern

Prediction of a logistic regression model

So where does that S curve come from? We just wrap the ax + b part in a function called Sigmoid. Furthermore, we have known that logistic regression predicts the probability of an instance belong to the positive class, which is denoted as P(y=1) in general. Overall, we obtain the equation below for single-feature logistic regression.

$P(y=1) = \dfrac{1}{1 + e^{-(ax + b)}}$

You can easily verify that, with very extreme values of x, (ax + b) gets either very positive or very negative, which makes e^{-(ax + b)} very close to 0 or very large, and in turns, P(y=1) approaches but never passes 1 or 0. The exponential function also makes the pattern curvy like in the figure. Finally, with more features, we just add more coefficients, still inside the Sigmoid function. The general equation of logistic regression is as

$P(y=1) = \dfrac{1}{1 + e^{-(a_0 + a_1x_1 + a_2x_2 + ... + a_kx_k)}}$

Regularized logistic regression

Like linear models, logistic models without any constrains can overfit data and learn fake patterns. To avoid this, we also regularize them. At a high level, logistic models are trained by minimizing training classification error, and we regularize them also by adding penalty terms:

minimize $training\_error + \alpha*penalty$

Again, exactly like linear models, there are three types of penalties, sum of squared coefficients, sum of absolute coefficients, and a mixture of the two. In terms of name, we call them L2 regularization, L1 regularization, and elastic-net regularization, respectively. In terms of behaviors, these three are just like ridge regression, lasso, and elastic-nets, but for logistic models.

Demonstration

Loading data and preliminary analysis

You can access the complete notebook in my GitHub. I will be using the heart_disease.csv data set which is originally from Kaggle. It consists of data from 918 patients including demography and some medical measurements. The target column is HeartDisease which indicates whether the patients had heart failure or not. After loading data, an info() shows no issues with missing values or data types.

heart_disease Download

In [1]:

import pandas as pd
import numpy as np

data = pd.read_csv('heart_disease.csv')
data.head(n=2)

Out[1]:

	Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
0	40	M	ATA	140	289	0	Normal	172	N	0.0	Up	0
1	49	F	NAP	160	180	0	Normal	156	N	1.0	Flat	1

In [2]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB

Next, we perform train test splitting then investigate histograms and bar charts. At a quick look, there are not any serious issues. However, upon investigating closer, we can see some 0 in Cholesterol and RestingBP. These number do not make sense medically, so I will turn them into missing values then impute with column medians.

In [3]:

from sklearn.model_selection import train_test_split

X = data.drop('HeartDisease', axis=1)
y = data['HeartDisease']
trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2)

In [4]:

num_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']

import matplotlib.pyplot as plt
trainX.hist(figsize=(6,8), bins=20)
plt.show()

In [5]:

cat_cols = ['Sex', 'ChestPainType','RestingECG', 'ExerciseAngina', 'ST_Slope']

for col in cat_cols:
    print(col)
    data[col].value_counts().plot(kind='bar', figsize=(4,2))
    plt.show()

Sex

ChestPainType

RestingECG

ExerciseAngina

ST_Slope

Processing

We only detect an issue with 0 in Cholesterol and RestingBP, so my pipeline is as follow
– Numeric columns: 1) remove 0 from Cholesterol and RestingBP => 2) impute with medina => 3) standardization
– Categorical columns: one hot encoding
Overall, the pipeline is as follows.

In [6]:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

def remove_0(X):
    X.loc[X['Cholesterol']==0, 'Cholesterol'] = np.nan
    X.loc[X['RestingBP']==0, 'RestingBP'] = np.nan
    return X

num_pipeline = Pipeline([
    ('remove 0', FunctionTransformer(remove_0, validate=False)),
    ('impute', SimpleImputer(strategy='median')),
    ('standardize', StandardScaler())
])

cat_pipeline = Pipeline([
    ('encode', OneHotEncoder())
])

full_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

trainX_prc = full_pipeline.fit_transform(trainX)
testX_prc = full_pipeline.transform(testX)  

Modeling

L2 regularized logistic regression

First, let us try the L2 regularized model. We simply add penalty='l2' to have this model. Remember that this method is similar to Ridge regression, therefore we need to finetune the strength parameter, which is now C. Other than that, the code is pretty much the same with creating a parameter grid then creating and fitting a grid search. After fitting, we can obtain the selected C with best_params_, which is 0.05. The training CV accuracy is 85.56% from best_score_ and testing accuracy 88.04% by calling score with the testing features and label.

In [7]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#the default is l2 anyway, but we can specify it so that the code is clear
#logistic regression is also trained iteratively, we can increase max_iter if you see some warning from sklearn
logistic = LogisticRegression(penalty='l2', max_iter=5000)

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Out[7]:

GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000),
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100]}],
             return_train_score=True, scoring='accuracy')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [8]:

print(grid_search.best_params_)
print(grid_search.best_score_) #accuracy

{'C': 0.05}
0.8556425309849967

In [9]:

grid_search.score(testX_prc, testY) #accuracy

Out[9]:

0.8804347826086957

L1 logistic regression

If L2 regularization is similar to Ridge, then L1 Lasso. And that pretty much it to remember about this model. In terms of using, we change penalty to l1 and add solver='liblinear' since it is required by SKLearn. The rest are the same as the L2 model. In this case, our L1 model gets a training CV accuracy of 85.97% and testing accuracy 85.87%.

In [10]:

param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#now we need to specify penalty to l1
#also, we need to set solver to 'liblinear' because the default solver doesn't support l1
logistic = LogisticRegression(penalty='l1', max_iter=5000, solver='liblinear')

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Out[10]:

GridSearchCV(cv=5,
             estimator=LogisticRegression(max_iter=5000, penalty='l1',
                                          solver='liblinear'),
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100]}],
             return_train_score=True, scoring='accuracy')

In [11]:

print(grid_search.best_params_)
print(grid_search.best_score_) #accuracy

{'C': 0.1}
0.859742801230081

In [12]:

grid_search.score(testX_prc, testY) #accuracy

Out[12]:

0.8586956521739131

Elastic-net logistic regression

Finally, the elastic-net logistic regression is the same with its counterpart in linear models. It uses a mixture of L1 and L2 penalty, which means we also needs to tune the l1_ratio parameter. To use this method, we set penalty='enet' and solver='saga'. This one gets a training CV accuracy of 86.24% and testing accuracy of 85.87%.

In [13]:

#now penalty is changed to elasticnet
#and we need to change solver to saga
logistic = LogisticRegression(penalty='elasticnet', max_iter=5000, solver='saga')

param_grid = [{
    'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100],
    'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
}]

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Out[13]:

GridSearchCV(cv=5,
             estimator=LogisticRegression(max_iter=5000, penalty='elasticnet',
                                          solver='saga'),
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100],
                          'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                                       0.9]}],
             return_train_score=True, scoring='accuracy')

In [14]:

print(grid_search.best_params_)
print(grid_search.best_score_) #accuracy

{'C': 0.05, 'l1_ratio': 0.4}
0.8624452520734321

In [15]:

grid_search.score(testX_prc, testY)

Out[15]:

0.8586956521739131

Conclusion

As you can see after the tests, the three types of regularized logistic regression behave pretty similar in terms of performance. So, just use them by your preferences, unless you want absolute performance then try finetuning all three. Anyway, I hope you have had a good understanding about Logistic regression after this post. This is probably my longest post so far, so it is time to stop. See you again next time!

Logistic Regression