I do not fully agree with the saying “the end justifies the mean”. The sum also justifies the mean, not just the n
… Yes, it is a dad joke when the dad is a statistician… I am just bringing up the topic of this post: descriptive statistics. As we have now had a basic understanding about Python programming, it is time to do some data stuffs. We will start with learning about these measurements that describe your data.
Mean
Now we will officially talk about the mean, probably the most common descriptive statistics. The mean is a measurement of central tendency, meaning that it describes or measures the center of your data. Does that sound a bit too abstract? Removing all those fancy terms, the mean is basically the average of something. Given a column of data, the mean is calculated by the sum of all values divided by the number of values.
For example, if you have a small data set like below, the mean of Age is (21+40+22+32+35)/5 = 30
. By the way, we usually denote the number of values as n
, hence the joke at the beginning.
Employee ID | Age | Department | Salary |
12000321 | 21 | Data Science | 85000 |
10003512 | 40 | Data Science | 450000 |
12162135 | 22 | Information Technology | 110000 |
11151323 | 32 | Information Technology | 121000 |
12311494 | 35 | Information Technology | 103000 |
We very commonly use the mean to observe where the center of some data is. However, in certain cases, the mean alone is not enough and could be misleading. To illustrate, let us calculate the mean salary of the five employees: (85000+450000+110000+121000+103000)/5 = 173800
. Do we really expect that an average employee in this company makes $173,800 per year? Not exactly, four among five of them make way lower than that. The average in this case is very high and misleading due to one employee making $450,000. We call this person an extreme value, or an outlier. The mean is susceptible to outliers. To have a better centrality measurement in this situation, we use the median.
Median
The median is also a measurement of central tendency, however, is calculated very differently from the mean. It is determined as the middle value in the data. In other words, we first sort the values in a column (either ascending or descending order). After sorting, the median lies exactly in the middle.
Return to the small data previously, to find the median of Age
, first we sort the values. So, we have [21, 22, 32, 35, 40]
. The middle point in this list is 32
, so the median of age is 32
. Now onto Salary
, we will do the same thing. The sorted salary is [85000, 103000, 110000, 121000, 450000]
, so the median salary in this company is 110000
. We can now conclude that the average employee in this company makes $110,000 per year which has a lot more senses than before with the mean.
Unlike the mean, outliers do not affect the median. So for data with many outliers, we prefer to use medians to measure centralities. In data without (or with very few) extreme values, we expect the mean and median to be similar.
Mode
The mode is the third measurement of central tendency that you may see times to times. It represents the most frequent value in your data. However, you may have noticed, in the employee data previously, all values in Age
and Salary
are unique in that they appear exactly once. In this case, the mode is calculated based on the range of values that have the highest frequency.
I will not discuss modes for numeric data further since they are not that common. The reason I bring up mode in this post is because it is the only measurement of central tendency for categorical data. Back to the employee data, we cannot calculate mean or median for Department as they are not number. The mode, however, is Information Technology
because it has the highest count. Also note that Information Technology
is not exactly the center of this column, since there are no meaningful comparisons with other values like Data Science
. It only represents the most frequent value in the column.
Standard Deviation and Variance
Now we move to a different concept, measurement of dispersion, which describe how spread out the data from the center. Conceptually, the variance is the average squared deviation from the mean of values in a column, and standard deviation is their average deviation from the mean. Mathematically, they are slightly different from the exact average, but we will not go too deep into that here. As categorical data does not have a center, we cannot calculate their variances and standard deviations.
Conclusion
This post introduces you to some descriptive statistics – measurements that help you better understand your data. However, as you can notice, having these numbers alone is still not that intuitive. Usually, we use them in combination with visualized representations of the data. So, we will discuss exploratory analysis and visualization in the next post.
Pingback: Exploratory Analysis - Data Science from a Practical Perspective
Pingback: Analyzing Numeric Distributions - Data Science from a Practical Perspective
Pingback: Handling Outliers - Data Science from a Practical Perspective