Correlation with Categorical Data

an illustration of different tools for analysis of correlation with categorical data like side-by-side boxplots and stacked bar charts

Previously, we have learned about tools for correlation analysis on numeric columns. By now though, you should have known that numeric is not the only type of data. In fact, a majority of tabular data you find out there would have a mixture of numeric and categorical columns. Therefore, analyzing columns’ correlation with categorical data is also important. With that motivation, in this post, we will discuss ways to examine data correlations with at least one column being categorical. This means that we will go through numeric vs. categorical and categorical vs. categorical. Like the previous post, we will focus more on visualization tools instead of statistical measurements and tests. So, let us start right away!

Data in demonstration

Once again, we will use the students1000.csv data; the complete notebook is available here. We perform loading and drop StudentID from the dataframe as it is not of interests at this point. This requires the use of the dataframe function drop(), with the inputs being the column top drop and axis=1 to specify the removed object is a column.

Numeric columns’ correlation with categorical data

Measuring the correlation between categorical numeric columns is different from doing so for two numeric ones, simply because there are no exact increases or decreases in values in the categories. Of course, if they are ordinal, you can transform the categories into number like 1, 2, 3, and the analysis still works somewhat. Regardless, the gaps between the ordinal values may not be equal like 2 - 1 and 3 - 2, so correlation coefficients an scatter plots are not as useful. In general, we usually evaluate the correlation of categorical and numeric columns by checking whether the distributions of the numeric columns differ by classes in the categorical one.

While there are analysis tools that provided exact measurement, we will stick to a more practical tool which is the side-by-side boxplot. We have discussed boxplot previously as a tool to describe individual columns. Going further than that, a side-by-side boxplot splits values in the numeric column into groups that have the same class in a categorical column. Data in each group is then drawn with one box in the plot. So, by looking at all the boxes, we can compare how the distribution of the numeric column is varied by the categorical column. Below is one example of a side-by-side boxplot where GPA is split into four study years including Freshman, Sophomore, Junior, and Senior.

a side-by-side boxplot that illustrates a numeric column's correlation with  categorical columns

The more the two columns are correlated, the more difference in the boxes across the classes. In the example below, from left to right, we have a strongly correlated case with very different ranges between the boxes, and a no-correlation case with two boxes majorly overlapping each other.

a side-by-side boxplot that illustrates a numeric column's strong correlation with  categorical columns
a side-by-side boxplot that illustrates a numeric column's weak correlation with categorical columns

In Pandas

With Pandas, we can draw side-by-side boxplots very easily. First, we slice the two columns of interests from the dataframe, then we call boxplot() function with option by='cat_col' to specify the stratifying column. In short, the statement is as dataframe[['cat_col','num_col']].boxplot(by='cat_col'). We can add other options like rot= for label rotation, and figsize= to set the figure size. Below is an example with Major and FirstYearGPA. We can see that, there are barely, if any, correlations between these two columns since the four boxes overlap each other by a large margin.

Correlation with categorical columns

Similarly like between categorical and numeric columns, the correlation of two categorical columns is defined by how one distribution being varied by the other. The visualization for this is a stacked bar chart and a 100% stacked bar chart. The former embeds the number of instances in each stratified class into the bars’ heights whereas the latter scales to 100% and shows proportions instead of frequencies.

To draw a stacked bar chart, we use the Pandas function crosstab() with the two sliced columns as inputs (the stratified one goes second) followed by plot.bar(stack='True'). Similar like before, we can add options for rotation and figure sizes. Below is an example that draws the stacked bar chart of Major stratified by State. We can see that this plot is rather difficult to use because of the different frequencies of each state.

Therefore, it is probably easier to compare distributions using proportions of classes with the 100% stacked bar chart. Drawing this one is a bit more complicated because we have to manually calculate the proportion of each subclass. The code below performs all the necessary calculations which you can reuse. You just need to remember to change the dataframe name and the column names, and remember that the stratified column comes first.

In the example, we can see some differences among the states, especially in NY and WA. However, these two have very low number of students (6) compared to the other four, so it is actually not conclusive whether they are actually different. Overall, I would say these two columns are not correlated.

Wrapping up

In this post, we have discussed a few visualization techniques for analyzing correlation that has categorical columns. With this, we have cover pretty well correlation analysis. So, in the next post, we will move on to common issues in exploratory analysis. See you then!