Exploring Categorical Distributions

an illustration of tools for analyzing categorical distributions like frequency tables, bar charts, and pie charst

Surely after discussing distribution analysis on numeric data, we will move on to categorical data, right? Of course! In this post, we will discuss tools for performing analysis on categorical distributions using Pandas and Matplotlib. Unlike their numeric counterpart, categorical data has fewer tools and furthermore, they are easier to understand. So, let us not waste any more times and start right away!

Data to demonstrate

In this post, I will use a data set with more categorical columns – the loans data which is available on Kaggle. The data consists of information on loans from banks such as loan amounts, types, status, and customers’ financial states. This is an interesting data set with a lot to look at, however, I will just focus on its categorical columns at this moment. You can access the complete notebook here. First, we load the data in and check a few rows as well as its info().

All columns are in their correct data types. object columns include Loan ID, Customer ID, Loan Status, Term, Years in current job, Home Ownership, and Purpose. Next, I will create a slice with only categorical data for ease of access later one. Of course, Loan ID and Customer ID are not in this slice.

Tools for analyzing categorical distributions

For analysis on categorical distributions, we have a few tools: frequency tables, bar charts, and pie charts. First, frequency tables are just as their name: they list all unique class in the column and their frequencies. Next, bar charts are just visualizations of frequency tables and are similar to histograms. In a bar chart, there is one bar for each class, and the height of the bar represents the frequency of the class. Finally, pie charts illustrate the proportions of the classes. A pie chart has a slice for each class, and the slices’ areas represent the classes’ ratios in the data. Below is an example of a frequency table, a bar chart, and a pie chart for the same data.

CategoryCount
Strongly Disagree32
Disagree43
Neutral75
Agree135
Strongly Agree96
bar chart to illustrate categorical distribution
pie chart chart to illustrate categorical distribution

Frequency table

Now let us create the three tools in Python. For frequency table, we use the function value_counts(). However, it is a bit unfortunate that we need to apply value_counts() to each column instead of the whole dataframe (which will generate a cross-frequency table). To automate this, we can use a for loop with a variable iterating through the columns. The code is as follows. You can see that I added a print() for the column names, and another print() that add a line between two tables, all for readability.

Reading frequency tables is nothing complicated – they are literally classes and their counts. Though, we should pay attention to some classes with very low counts (compared to the rest) like the last four or five of them in Purpose. These may need special treatment in processing later on.

Bar charts

Now, let us move on to bar charts. To draw bar charts, we add .plot.bar() after the call to value_counts(). This means that we have to manually draw a chart for each column. Therefore, we utilize another for loop. Since the code now involves plotting, we import matplotlib just like before. The rot= option rotates the class labels so that they do not overlay each other. However, the option is not enough for columns with long class texts, so I added another two lines of code (those start with ax) to trim them to 11 characters. You can remove these codes if your data does not have such columns.

Pie charts

Drawing pie charts is similar to bar charts, we simply replace the bar() function with pie(). And actually that is it! In terms of visualization, we can see some issues like the texts are a bit small, or some texts overlap each other when their slices are too small. Surely, we can adjust all this. However, this is not a post on visualization (yet!), so I will leave that for later.

Conclusion

Analyzing categorical distributions is somewhat easier than numeric ones because there are less complicated tools. Regardless, it is important to do, as there could be potential issues like those rare classes in Purpose. Now that we have known how to perform distribution analysis, we will move on to correlation analysis in the next post. So, stay in tune, and see you!

1 Comment

Comments are closed