With the previous introduction to classification analysis, we can now discuss more about classification models. Since we have had some exposure to logistic regression last time, let us continue with its story. Logistic regression is pretty much the linear model of classification, both because characteristically, it is indeed linear, and also because it has similar complexity to linear regression. So, let us not wait further, and jump right in!
Logistic regression
First, let us recall, the equation of a simple linear model is y = ax + b
, with y
being the target, x
being the single feature, and a
and b
are coefficient and intercept, respectively. This equation gives us the straight line pattern in a 2-dimensional visualization as in the left figure below. Furthermore, y is unbounded in linear regression. Previously, we have seen that logistic regression prediction patterns form an S curve like in the right figure below.
Prediction of a linear regression model
Prediction of a logistic regression model
So where does that S curve come from? We just wrap the ax + b
part in a function called Sigmoid. Furthermore, we have known that logistic regression predicts the probability of an instance belong to the positive class, which is denoted as P(y=1)
in general. Overall, we obtain the equation below for single-feature logistic regression.
You can easily verify that, with very extreme values of x
, (ax + b)
gets either very positive or very negative, which makes e-(ax + b)
very close to 0
or very large, and in turns, P(y=1)
approaches but never passes 1
or 0
. The exponential function also makes the pattern curvy like in the figure. Finally, with more features, we just add more coefficients, still inside the Sigmoid function. The general equation of logistic regression is as
Regularized logistic regression
Like linear models, logistic models without any constrains can overfit data and learn fake patterns. To avoid this, we also regularize them. At a high level, logistic models are trained by minimizing training classification error, and we regularize them also by adding penalty terms:
minimize
Again, exactly like linear models, there are three types of penalties, sum of squared coefficients, sum of absolute coefficients, and a mixture of the two. In terms of name, we call them L2 regularization, L1 regularization, and elastic-net regularization, respectively. In terms of behaviors, these three are just like ridge regression, lasso, and elastic-nets, but for logistic models.
Demonstration
Loading data and preliminary analysis
You can access the complete notebook in my GitHub. I will be using the heart_disease.csv
data set which is originally from Kaggle. It consists of data from 918 patients including demography and some medical measurements. The target column is HeartDisease
which indicates whether the patients had heart failure or not. After loading data, an info()
shows no issues with missing values or data types.
import pandas as pd
import numpy as np
data = pd.read_csv('heart_disease.csv')
data.head(n=2)
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null object 2 ChestPainType 918 non-null object 3 RestingBP 918 non-null int64 4 Cholesterol 918 non-null int64 5 FastingBS 918 non-null int64 6 RestingECG 918 non-null object 7 MaxHR 918 non-null int64 8 ExerciseAngina 918 non-null object 9 Oldpeak 918 non-null float64 10 ST_Slope 918 non-null object 11 HeartDisease 918 non-null int64 dtypes: float64(1), int64(6), object(5) memory usage: 86.2+ KB
Next, we perform train test splitting then investigate histograms and bar charts. At a quick look, there are not any serious issues. However, upon investigating closer, we can see some 0
in Cholesterol
and RestingBP
. These number do not make sense medically, so I will turn them into missing values then impute with column medians.
from sklearn.model_selection import train_test_split
X = data.drop('HeartDisease', axis=1)
y = data['HeartDisease']
trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2)
num_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']
import matplotlib.pyplot as plt
trainX.hist(figsize=(6,8), bins=20)
plt.show()
cat_cols = ['Sex', 'ChestPainType','RestingECG', 'ExerciseAngina', 'ST_Slope']
for col in cat_cols:
print(col)
data[col].value_counts().plot(kind='bar', figsize=(4,2))
plt.show()
Sex
ChestPainType
RestingECG
ExerciseAngina
ST_Slope
Processing
We only detect an issue with 0 in Cholesterol and RestingBP, so my pipeline is as follow
– Numeric columns: 1) remove 0 from Cholesterol and RestingBP => 2) impute with medina => 3) standardization
– Categorical columns: one hot encoding
Overall, the pipeline is as follows.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
def remove_0(X):
X.loc[X['Cholesterol']==0, 'Cholesterol'] = np.nan
X.loc[X['RestingBP']==0, 'RestingBP'] = np.nan
return X
num_pipeline = Pipeline([
('remove 0', FunctionTransformer(remove_0, validate=False)),
('impute', SimpleImputer(strategy='median')),
('standardize', StandardScaler())
])
cat_pipeline = Pipeline([
('encode', OneHotEncoder())
])
full_pipeline = ColumnTransformer([
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
trainX_prc = full_pipeline.fit_transform(trainX)
testX_prc = full_pipeline.transform(testX)
Modeling
L2 regularized logistic regression
First, let us try the L2 regularized model. We simply add penalty='l2'
to have this model. Remember that this method is similar to Ridge regression, therefore we need to finetune the strength parameter, which is now C
. Other than that, the code is pretty much the same with creating a parameter grid then creating and fitting a grid search. After fitting, we can obtain the selected C
with best_params_
, which is 0.05
. The training CV accuracy is 85.56%
from best_score_
and testing accuracy 88.04%
by calling score with the testing features and label.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]
#the default is l2 anyway, but we can specify it so that the code is clear
#logistic regression is also trained iteratively, we can increase max_iter if you see some warning from sklearn
logistic = LogisticRegression(penalty='l2', max_iter=5000)
grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)
grid_search.fit(trainX_prc,trainY)
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000), param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]}], return_train_score=True, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000), param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]}], return_train_score=True, scoring='accuracy')
LogisticRegression(max_iter=5000)
LogisticRegression(max_iter=5000)
print(grid_search.best_params_)
print(grid_search.best_score_) #accuracy
{'C': 0.05} 0.8556425309849967
grid_search.score(testX_prc, testY) #accuracy
0.8804347826086957
L1 logistic regression
If L2 regularization is similar to Ridge, then L1 Lasso. And that pretty much it to remember about this model. In terms of using, we change penalty
to l1
and add solver='liblinear'
since it is required by SKLearn. The rest are the same as the L2 model. In this case, our L1 model gets a training CV accuracy of 85.97%
and testing accuracy 85.87%
.
param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]
#now we need to specify penalty to l1
#also, we need to set solver to 'liblinear' because the default solver doesn't support l1
logistic = LogisticRegression(penalty='l1', max_iter=5000, solver='liblinear')
grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)
grid_search.fit(trainX_prc,trainY)
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000, penalty='l1', solver='liblinear'), param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]}], return_train_score=True, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000, penalty='l1', solver='liblinear'), param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]}], return_train_score=True, scoring='accuracy')
LogisticRegression(max_iter=5000, penalty='l1', solver='liblinear')
LogisticRegression(max_iter=5000, penalty='l1', solver='liblinear')
print(grid_search.best_params_)
print(grid_search.best_score_) #accuracy
{'C': 0.1} 0.859742801230081
grid_search.score(testX_prc, testY) #accuracy
0.8586956521739131
Elastic-net logistic regression
Finally, the elastic-net logistic regression is the same with its counterpart in linear models. It uses a mixture of L1 and L2 penalty, which means we also needs to tune the l1_ratio
parameter. To use this method, we set penalty='enet'
and solver='saga'
. This one gets a training CV accuracy of 86.24%
and testing accuracy of 85.87%
.
#now penalty is changed to elasticnet
#and we need to change solver to saga
logistic = LogisticRegression(penalty='elasticnet', max_iter=5000, solver='saga')
param_grid = [{
'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100],
'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
}]
grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)
grid_search.fit(trainX_prc,trainY)
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000, penalty='elasticnet', solver='saga'), param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100], 'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}], return_train_score=True, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000, penalty='elasticnet', solver='saga'), param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100], 'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}], return_train_score=True, scoring='accuracy')
LogisticRegression(max_iter=5000, penalty='elasticnet', solver='saga')
LogisticRegression(max_iter=5000, penalty='elasticnet', solver='saga')
print(grid_search.best_params_)
print(grid_search.best_score_) #accuracy
{'C': 0.05, 'l1_ratio': 0.4} 0.8624452520734321
grid_search.score(testX_prc, testY)
0.8586956521739131
Conclusion
As you can see after the tests, the three types of regularized logistic regression behave pretty similar in terms of performance. So, just use them by your preferences, unless you want absolute performance then try finetuning all three. Anyway, I hope you have had a good understanding about Logistic regression after this post. This is probably my longest post so far, so it is time to stop. See you again next time!
Pingback: Support Vector Machine Pipeline - Data Science from a Practical Perspective