an illustration of tools for analyzing categorical distributions like frequency tables, bar charts, and pie charst

Surely after discussing distribution analysis on numeric data, we will move on to categorical data, right? Of course! In this post, we will discuss tools for performing analysis on categorical distributions using Pandas and Matplotlib. Unlike their numeric counterpart, categorical data has fewer tools and furthermore, they are easier to understand. So, let us not waste any more times and start right away!

Data to demonstrate

In this post, I will use a data set with more categorical columns – the loans data which is available on Kaggle. The data consists of information on loans from banks such as loan amounts, types, status, and customers’ financial states. This is an interesting data set with a lot to look at, however, I will just focus on its categorical columns at this moment. You can access the complete notebook here. First, we load the data in and check a few rows as well as its info().

loans.csv Download

In [5]:

import pandas as pd

loans = pd.read_csv('loans.csv')
loans.head(n=2)

Out[5]:

	Loan ID	Customer ID	Loan Status	Current Loan Amount	Term	Credit Score	Annual Income	Years in current job	Home Ownership	Purpose	Monthly Debt	Years of Credit History	Months since last delinquent	Number of Open Accounts	Number of Credit Problems	Current Credit Balance	Maximum Open Credit	Bankruptcies	Tax Liens
0	14dd8831-6af5-400b-83ec-68e61888a048	981165ec-3274-42f5-a3b4-d104041a9ca9	Fully Paid	445412	Short Term	709.0	1167493.0	8 years	Home Mortgage	Home Improvements	5214.74	17.2	NaN	6	1	228190	416746.0	1.0	0.0
1	4771cc26-131a-45db-b5aa-537ea4ba5342	2de017a3-2e01-49cb-a581-08169e83be29	Fully Paid	262328	Short Term	NaN	NaN	10+ years	Home Mortgage	Debt Consolidation	33295.98	21.1	8.0	35	0	229976	850784.0	0.0	0.0

In [7]:

loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Loan ID                       100000 non-null  object 
 1   Customer ID                   100000 non-null  object 
 2   Loan Status                   100000 non-null  object 
 3   Current Loan Amount           100000 non-null  int64  
 4   Term                          100000 non-null  object 
 5   Credit Score                  80846 non-null   float64
 6   Annual Income                 80846 non-null   float64
 7   Years in current job          95778 non-null   object 
 8   Home Ownership                99786 non-null   object 
 9   Purpose                       99790 non-null   object 
 10  Monthly Debt                  100000 non-null  float64
 11  Years of Credit History       100000 non-null  float64
 12  Months since last delinquent  46859 non-null   float64
 13  Number of Open Accounts       100000 non-null  int64  
 14  Number of Credit Problems     100000 non-null  int64  
 15  Current Credit Balance        100000 non-null  int64  
 16  Maximum Open Credit           99998 non-null   float64
 17  Bankruptcies                  99796 non-null   float64
 18  Tax Liens                     99990 non-null   float64
dtypes: float64(8), int64(4), object(7)
memory usage: 14.5+ MB

All columns are in their correct data types. object columns include Loan ID, Customer ID, Loan Status, Term, Years in current job, Home Ownership, and Purpose. Next, I will create a slice with only categorical data for ease of access later one. Of course, Loan ID and Customer ID are not in this slice.

In [17]:

loans_cat = loans[['Loan Status', 'Term', 'Years in current job', 'Home Ownership', 'Purpose']]
loans_cat.head(n=3)

Out[17]:

	Loan Status	Term	Years in current job	Home Ownership	Purpose
0	Fully Paid	Short Term	8 years	Home Mortgage	Home Improvements
1	Fully Paid	Short Term	10+ years	Home Mortgage	Debt Consolidation
2	Fully Paid	Short Term	8 years	Own Home	Debt Consolidation

Tools for analyzing categorical distributions

For analysis on categorical distributions, we have a few tools: frequency tables, bar charts, and pie charts. First, frequency tables are just as their name: they list all unique class in the column and their frequencies. Next, bar charts are just visualizations of frequency tables and are similar to histograms. In a bar chart, there is one bar for each class, and the height of the bar represents the frequency of the class. Finally, pie charts illustrate the proportions of the classes. A pie chart has a slice for each class, and the slices’ areas represent the classes’ ratios in the data. Below is an example of a frequency table, a bar chart, and a pie chart for the same data.

Category	Count
Strongly Disagree	32
Disagree	43
Neutral	75
Agree	135
Strongly Agree	96

bar chart to illustrate categorical distribution

pie chart chart to illustrate categorical distribution

Frequency table

Now let us create the three tools in Python. For frequency table, we use the function value_counts(). However, it is a bit unfortunate that we need to apply value_counts() to each column instead of the whole dataframe (which will generate a cross-frequency table). To automate this, we can use a for loop with a variable iterating through the columns. The code is as follows. You can see that I added a print() for the column names, and another print() that add a line between two tables, all for readability.

In [18]:

for col in loans_cat:
    print(col)
    print(loans_cat[col].value_counts())
    print("----------------------------")

Loan Status
Fully Paid     77361
Charged Off    22639
Name: Loan Status, dtype: int64
----------------------------
Term
Short Term    72208
Long Term     27792
Name: Term, dtype: int64
----------------------------
Years in current job
10+ years    31121
2 years       9134
3 years       8169
< 1 year      8164
5 years       6787
1 year        6460
4 years       6143
6 years       5686
7 years       5577
8 years       4582
9 years       3955
Name: Years in current job, dtype: int64
----------------------------
Home Ownership
Home Mortgage    48410
Rent             42194
Own Home          9182
Name: Home Ownership, dtype: int64
----------------------------
Purpose
Debt Consolidation    78552
other                  6037
Home Improvements      5839
Other                  3250
Business Loan          1569
Buy a Car              1265
Medical Bills          1127
Buy House               678
Take a Trip             573
major_purchase          352
small_business          283
moving                  150
wedding                 115
Name: Purpose, dtype: int64
----------------------------

Reading frequency tables is nothing complicated – they are literally classes and their counts. Though, we should pay attention to some classes with very low counts (compared to the rest) like the last four or five of them in Purpose. These may need special treatment in processing later on.

Bar charts

Now, let us move on to bar charts. To draw bar charts, we add .plot.bar() after the call to value_counts(). This means that we have to manually draw a chart for each column. Therefore, we utilize another for loop. Since the code now involves plotting, we import matplotlib just like before. The rot= option rotates the class labels so that they do not overlay each other. However, the option is not enough for columns with long class texts, so I added another two lines of code (those start with ax) to trim them to 11 characters. You can remove these codes if your data does not have such columns.

In [45]:

import matplotlib.pyplot as plt

for col in loans_cat:
    print(col)
    plt.figure(figsize = (9,5))
    loans_cat[col].value_counts().plot.bar(rot=30)
    ax = plt.gca()
    ax.set_xticklabels(item.get_text()[:11] for item in ax.get_xticklabels())
    plt.show()
    print("----------------------------")

Loan Status

----------------------------
Term

----------------------------
Years in current job

----------------------------
Home Ownership

----------------------------
Purpose

----------------------------

Pie charts

Drawing pie charts is similar to bar charts, we simply replace the bar() function with pie(). And actually that is it! In terms of visualization, we can see some issues like the texts are a bit small, or some texts overlap each other when their slices are too small. Surely, we can adjust all this. However, this is not a post on visualization (yet!), so I will leave that for later.

In [56]:

import matplotlib.pyplot as plt

for col in loans_cat:
    print(col)
    plt.figure(figsize = (10,10))
    loans_cat[col].value_counts().plot.pie()
    plt.show()
    print("----------------------------")

Loan Status

----------------------------
Term

----------------------------
Years in current job

----------------------------
Home Ownership

----------------------------
Purpose

----------------------------

Conclusion

Analyzing categorical distributions is somewhat easier than numeric ones because there are less complicated tools. Regardless, it is important to do, as there could be potential issues like those rare classes in Purpose. Now that we have known how to perform distribution analysis, we will move on to correlation analysis in the next post. So, stay in tune, and see you!

Exploring Categorical Distributions