Previously, we have learned to analyze distributions of numeric and categorical columns. However, those techniques only focus on one individual column at a time. In exploratory analysis, we have a second type that determine associations between pairs of features which is called correlation analysis. And that is what we will do today. We will go through tools that help us describe the correlation between two features, starting with numeric vs. numeric. For this type, there are two main ways: using a correlation coefficient or a scatter plot. So, let us start!
Data for demonstration
I will use the students1000.csv in this post for demonstration. The process of loading is exactly the same as previously. You can also download the complete notebook here.
import pandas as pd
students = pd.read_csv('students1000.csv')
students.head(n=3)
StudentID | FirstName | LastName | Major | HighSchoolGPA | FamilyIncome | State | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 202303595 | Baxter | Dengler | Computer Science | 2.82 | 45013 | WA | 2.01 | 14.0 | 1.93 |
1 | 202309162 | Christian | Wickey | Data Science | 3.07 | 128358 | GA | 5.41 | NaN | 2.76 |
2 | 202306337 | Lonnie | Wulff | Software Engineering | 2.68 | 112392 | GA | 9.57 | 13.0 | 3.09 |
students['StudentID'] = students['StudentID'].astype('object')
students['StudentID'].dtype
dtype('O')
Scatter plots
A scatter plot is a visualization tool for correlation analysis. It draws all instances in the data on a coordinate system from two numeric columns. To be more intuitive, let us examine the scatter plot below. Here, the coordination is based on GPA as the vertical axis and Daily Study Time as the horizontal axis. Each dot in the plot represents one instance in the data. We can project any point onto the horizontal axis to get its Daily Study Time value, and the vertical axis to get its GPA value.
So how is this related to correlations? The pattern of the points in a scatter plot strongly describes the correlation of the two columns in the axes. Reading the patterns is quite easy:
– If you see any clear patterns in the points like a line, a curve, a zig zag, etc., the two columns have some correlation. The more defined the patterns, the stronger the correlations.
– A bottom left corner to top right corner pattern means a positive correlation
– A top left corner to bottom right corner pattern means a negative correlation
– If the points all spread out and seems random, the correlation is very weak or not existing. This also include patterns that go top-down or left-right.
In the three plots below, in the order from left to right, we have, 1) a strong positive correlation, 2) a weak negative correlation, and 3) no correlations between the two columns.
Scatter plots in Pandas
Similar to correlation coefficients, in Pandas, we can use a single function to obtain the scatter plots for all pairs of numeric columns in a dataframe. Specifically, we will utilize the Pandas function plotting.scatter_matrix()
. The main input to this function is a dataframe. The output is like a correlation matrix, however with each cell being a scatter plot of the two features, and diagonal entries being their histograms. Furthermore, features on the rows are always vertical axes, and features on the columns are always horizontal axes.
Now let us draw a scatter matrix for the students
data. Besides the input dataframe. I add a figsize
option to control the size of the complete figure. In the plots, we can clearly see the following:
– HighSchoolGPA
has an average correlation with AvgDailyStudyTime
, TotalAbsence
, and FirstYearGPA
– AvgDailyStudyTime
correlates weakly with TotalAbsence
and strongly with FirstYearGPA
– TotalAbsence
has a weak correlation with FirstYearGPA
– FamilyIncome
seems uncorrelated to all other columns
import matplotlib.pyplot as plt
plt.figure()
pd.plotting.scatter_matrix(students, figsize=(10,10))
plt.show()
<Figure size 432x288 with 0 Axes>
Linear and non-linear correlations
An important concept in correlation analysis is linearity. I will not go into details of the mathematical description at this moment, but rather describe using scatter plots. In short, as long as you see any defined patterns in a scatter plot, the two columns are correlated. Now, if the patterns is strongly resembling a straight line, you have a linear correlation, and if you have more of a curve, it is nonlinear. Below, from left to right, we have a linear correlation and a nonlinear correlation. Furthermore, the narrower the patterns, the stronger the correlations.
Why is this important? Because, as we will see later on, there is a whole family of predictive models operating based on linear correlations in your data. By the way, the stronger a linear correlation between two columns c1
and c2
, the more accurate the following equation x1 = ax2 + b
with x1
and x2
being two values in c1
and c2
that belong to the same rows, and a
and b
being two real numbers.
Correlation coefficient
A Pearson correlation coefficient is a measurement of the linear association of two numeric columns, i.e., how changes in values of one column relate to the other. We will ignore the linear part for now and get back to it when discussing linear regression. Correlation coefficient is denoted with the Greek letter rho ρ
. I will not show the math equation of ρ
here since Pandas will calculate that for us. ρ
is always in between -1
and 1
, and
– The closer ρ
is to -1
, the stronger the two columns linearly and negatively correlate. Roughly speaking, this means that an increase of values in column 1 more likely leads to a decrease in values in column 2. An example is the more absent lectures a student has, the more likely they have lower GPA.
– The closer ρ
is to 1
, the stronger the two columns linearly and positively correlate. Similar like before, an increase of values in column 1 more likely leads to an increase in values in column 2. For example, the more time a student spends on study, the more likely they have higher GPA.
– The closer ρ
is to 0
, the weaker the two columns correlated. Changes in one column less likely to affect the other. An example of this is a student’s number of siblings does not likely impact their GPA at all.
Correlation coefficients in Pandas
In Pandas, we call corr()
from a dataframe to obtain a correlation matrix of all numeric columns. A demonstration with the students
data is as below. The table is actually very easy to read. The number at the intersection of a row and a column is the correlation coefficient of the two features in the headers. For example, the correlation between HighSchoolGPA
and FamilyIncome
is 0.019
, these two does not seem to correlate. The correlation between FirstYearGPA
and AvgDailyStudyTime
is 0.89
, meaning they correlate very strongly.
students.corr()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
HighSchoolGPA | 1.000 | 0.019 | 0.434 | -0.520 | 0.492 |
FamilyIncome | 0.019 | 1.000 | 0.000 | -0.009 | 0.017 |
AvgDailyStudyTime | 0.434 | 0.000 | 1.000 | -0.190 | 0.893 |
TotalAbsence | -0.520 | -0.009 | -0.190 | 1.000 | -0.313 |
FirstYearGPA | 0.492 | 0.016 | 0.893 | -0.313 | 1.000 |
You can compare the correlation matrix with the scatter matrix above to get a feel of how correlation coefficients relate to the scatter patterns.
Wrapping up
In this post, we have discussed two tools for correlation analysis between two numeric columns, Pearson correlation coefficient and scatter plot. While the coefficient is useful at times, we may not need it in a lot of analyses. On the other hand, scatter plots are always good to look at. You can find some really interesting relationship just by looking at them. Up next, we will discuss correlation analysis between numeric and categorical columns.
Pingback: Correlation with Categorical Data - Data Science from a Practical Perspective