Surely after discussing distribution analysis on numeric data, we will move on to categorical data, right? Of course! In this post, we will discuss tools for performing analysis on categorical distributions using Pandas and Matplotlib. Unlike their numeric counterpart, categorical data has fewer tools and furthermore, they are easier to understand. So, let us not waste any more times and start right away!
Data to demonstrate
In this post, I will use a data set with more categorical columns – the loans
data which is available on Kaggle. The data consists of information on loans from banks such as loan amounts, types, status, and customers’ financial states. This is an interesting data set with a lot to look at, however, I will just focus on its categorical columns at this moment. You can access the complete notebook here. First, we load the data in and check a few rows as well as its info()
.
import pandas as pd
loans = pd.read_csv('loans.csv')
loans.head(n=2)
Loan ID | Customer ID | Loan Status | Current Loan Amount | Term | Credit Score | Annual Income | Years in current job | Home Ownership | Purpose | Monthly Debt | Years of Credit History | Months since last delinquent | Number of Open Accounts | Number of Credit Problems | Current Credit Balance | Maximum Open Credit | Bankruptcies | Tax Liens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14dd8831-6af5-400b-83ec-68e61888a048 | 981165ec-3274-42f5-a3b4-d104041a9ca9 | Fully Paid | 445412 | Short Term | 709.0 | 1167493.0 | 8 years | Home Mortgage | Home Improvements | 5214.74 | 17.2 | NaN | 6 | 1 | 228190 | 416746.0 | 1.0 | 0.0 |
1 | 4771cc26-131a-45db-b5aa-537ea4ba5342 | 2de017a3-2e01-49cb-a581-08169e83be29 | Fully Paid | 262328 | Short Term | NaN | NaN | 10+ years | Home Mortgage | Debt Consolidation | 33295.98 | 21.1 | 8.0 | 35 | 0 | 229976 | 850784.0 | 0.0 | 0.0 |
loans.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan ID 100000 non-null object 1 Customer ID 100000 non-null object 2 Loan Status 100000 non-null object 3 Current Loan Amount 100000 non-null int64 4 Term 100000 non-null object 5 Credit Score 80846 non-null float64 6 Annual Income 80846 non-null float64 7 Years in current job 95778 non-null object 8 Home Ownership 99786 non-null object 9 Purpose 99790 non-null object 10 Monthly Debt 100000 non-null float64 11 Years of Credit History 100000 non-null float64 12 Months since last delinquent 46859 non-null float64 13 Number of Open Accounts 100000 non-null int64 14 Number of Credit Problems 100000 non-null int64 15 Current Credit Balance 100000 non-null int64 16 Maximum Open Credit 99998 non-null float64 17 Bankruptcies 99796 non-null float64 18 Tax Liens 99990 non-null float64 dtypes: float64(8), int64(4), object(7) memory usage: 14.5+ MB
All columns are in their correct data types. object
columns include Loan ID
, Customer ID
, Loan Status
, Term
, Years in current job
, Home Ownership
, and Purpose
. Next, I will create a slice with only categorical data for ease of access later one. Of course, Loan ID
and Customer ID
are not in this slice.
loans_cat = loans[['Loan Status', 'Term', 'Years in current job', 'Home Ownership', 'Purpose']]
loans_cat.head(n=3)
Loan Status | Term | Years in current job | Home Ownership | Purpose | |
---|---|---|---|---|---|
0 | Fully Paid | Short Term | 8 years | Home Mortgage | Home Improvements |
1 | Fully Paid | Short Term | 10+ years | Home Mortgage | Debt Consolidation |
2 | Fully Paid | Short Term | 8 years | Own Home | Debt Consolidation |
Tools for analyzing categorical distributions
For analysis on categorical distributions, we have a few tools: frequency tables, bar charts, and pie charts. First, frequency tables are just as their name: they list all unique class in the column and their frequencies. Next, bar charts are just visualizations of frequency tables and are similar to histograms. In a bar chart, there is one bar for each class, and the height of the bar represents the frequency of the class. Finally, pie charts illustrate the proportions of the classes. A pie chart has a slice for each class, and the slices’ areas represent the classes’ ratios in the data. Below is an example of a frequency table, a bar chart, and a pie chart for the same data.
Category | Count |
Strongly Disagree | 32 |
Disagree | 43 |
Neutral | 75 |
Agree | 135 |
Strongly Agree | 96 |
Frequency table
Now let us create the three tools in Python. For frequency table, we use the function value_counts()
. However, it is a bit unfortunate that we need to apply value_counts()
to each column instead of the whole dataframe (which will generate a cross-frequency table). To automate this, we can use a for
loop with a variable iterating through the columns. The code is as follows. You can see that I added a print()
for the column names, and another print()
that add a line between two tables, all for readability.
for col in loans_cat:
print(col)
print(loans_cat[col].value_counts())
print("----------------------------")
Loan Status Fully Paid 77361 Charged Off 22639 Name: Loan Status, dtype: int64 ---------------------------- Term Short Term 72208 Long Term 27792 Name: Term, dtype: int64 ---------------------------- Years in current job 10+ years 31121 2 years 9134 3 years 8169 < 1 year 8164 5 years 6787 1 year 6460 4 years 6143 6 years 5686 7 years 5577 8 years 4582 9 years 3955 Name: Years in current job, dtype: int64 ---------------------------- Home Ownership Home Mortgage 48410 Rent 42194 Own Home 9182 Name: Home Ownership, dtype: int64 ---------------------------- Purpose Debt Consolidation 78552 other 6037 Home Improvements 5839 Other 3250 Business Loan 1569 Buy a Car 1265 Medical Bills 1127 Buy House 678 Take a Trip 573 major_purchase 352 small_business 283 moving 150 wedding 115 Name: Purpose, dtype: int64 ----------------------------
Reading frequency tables is nothing complicated – they are literally classes and their counts. Though, we should pay attention to some classes with very low counts (compared to the rest) like the last four or five of them in Purpose
. These may need special treatment in processing later on.
Bar charts
Now, let us move on to bar charts. To draw bar charts, we add .plot.bar()
after the call to value_counts()
. This means that we have to manually draw a chart for each column. Therefore, we utilize another for loop. Since the code now involves plotting, we import matplotlib just like before. The rot=
option rotates the class labels so that they do not overlay each other. However, the option is not enough for columns with long class texts, so I added another two lines of code (those start with ax
) to trim them to 11 characters. You can remove these codes if your data does not have such columns.
import matplotlib.pyplot as plt
for col in loans_cat:
print(col)
plt.figure(figsize = (9,5))
loans_cat[col].value_counts().plot.bar(rot=30)
ax = plt.gca()
ax.set_xticklabels(item.get_text()[:11] for item in ax.get_xticklabels())
plt.show()
print("----------------------------")
Loan Status
---------------------------- Term
---------------------------- Years in current job
---------------------------- Home Ownership
---------------------------- Purpose
----------------------------
Pie charts
Drawing pie charts is similar to bar charts, we simply replace the bar()
function with pie()
. And actually that is it! In terms of visualization, we can see some issues like the texts are a bit small, or some texts overlap each other when their slices are too small. Surely, we can adjust all this. However, this is not a post on visualization (yet!), so I will leave that for later.
import matplotlib.pyplot as plt
for col in loans_cat:
print(col)
plt.figure(figsize = (10,10))
loans_cat[col].value_counts().plot.pie()
plt.show()
print("----------------------------")
Loan Status
---------------------------- Term
---------------------------- Years in current job
---------------------------- Home Ownership
---------------------------- Purpose
----------------------------
Conclusion
Analyzing categorical distributions is somewhat easier than numeric ones because there are less complicated tools. Regardless, it is important to do, as there could be potential issues like those rare classes in Purpose
. Now that we have known how to perform distribution analysis, we will move on to correlation analysis in the next post. So, stay in tune, and see you!
Pingback: Correlation Analysis on Two Numeric Columns - Data Science from a Practical Perspective