Descriptive Statistics - Data Science from a Practical Perspective

I do not fully agree with the saying “the end justifies the mean”. The sum also justifies the mean, not just the n… Yes, it is a dad joke when the dad is a statistician… I am just bringing up the topic of this post: descriptive statistics. As we have now had a basic understanding about Python programming, it is time to do some data stuffs. We will start with learning about these measurements that describe your data.

Mean

Now we will officially talk about the mean, probably the most common descriptive statistics. The mean is a measurement of central tendency, meaning that it describes or measures the center of your data. Does that sound a bit too abstract? Removing all those fancy terms, the mean is basically the average of something. Given a column of data, the mean is calculated by the sum of all values divided by the number of values.

For example, if you have a small data set like below, the mean of Age is (21+40+22+32+35)/5 = 30. By the way, we usually denote the number of values as n, hence the joke at the beginning.

Employee ID	Age	Department	Salary
12000321	21	Data Science	85000
10003512	40	Data Science	450000
12162135	22	Information Technology	110000
11151323	32	Information Technology	121000
12311494	35	Information Technology	103000

We very commonly use the mean to observe where the center of some data is. However, in certain cases, the mean alone is not enough and could be misleading. To illustrate, let us calculate the mean salary of the five employees: (85000+450000+110000+121000+103000)/5 = 173800. Do we really expect that an average employee in this company makes $173,800 per year? Not exactly, four among five of them make way lower than that. The average in this case is very high and misleading due to one employee making $450,000. We call this person an extreme value, or an outlier. The mean is susceptible to outliers. To have a better centrality measurement in this situation, we use the median.

Median

The median is also a measurement of central tendency, however, is calculated very differently from the mean. It is determined as the middle value in the data. In other words, we first sort the values in a column (either ascending or descending order). After sorting, the median lies exactly in the middle.

Return to the small data previously, to find the median of Age, first we sort the values. So, we have [21, 22, 32, 35, 40]. The middle point in this list is 32, so the median of age is 32. Now onto Salary, we will do the same thing. The sorted salary is [85000, 103000, 110000, 121000, 450000], so the median salary in this company is 110000. We can now conclude that the average employee in this company makes $110,000 per year which has a lot more senses than before with the mean.

Unlike the mean, outliers do not affect the median. So for data with many outliers, we prefer to use medians to measure centralities. In data without (or with very few) extreme values, we expect the mean and median to be similar.

Mode

The mode is the third measurement of central tendency that you may see times to times. It represents the most frequent value in your data. However, you may have noticed, in the employee data previously, all values in Age and Salary are unique in that they appear exactly once. In this case, the mode is calculated based on the range of values that have the highest frequency.

I will not discuss modes for numeric data further since they are not that common. The reason I bring up mode in this post is because it is the only measurement of central tendency for categorical data. Back to the employee data, we cannot calculate mean or median for Department as they are not number. The mode, however, is Information Technology because it has the highest count. Also note that Information Technology is not exactly the center of this column, since there are no meaningful comparisons with other values like Data Science. It only represents the most frequent value in the column.

Standard Deviation and Variance

Now we move to a different concept, measurement of dispersion, which describe how spread out the data from the center. Conceptually, the variance is the average squared deviation from the mean of values in a column, and standard deviation is their average deviation from the mean. Mathematically, they are slightly different from the exact average, but we will not go too deep into that here. As categorical data does not have a center, we cannot calculate their variances and standard deviations.

Conclusion

This post introduces you to some descriptive statistics – measurements that help you better understand your data. However, as you can notice, having these numbers alone is still not that intuitive. Usually, we use them in combination with visualized representations of the data. So, we will discuss exploratory analysis and visualization in the next post.

Mean

Median

Mode

Standard Deviation and Variance

Conclusion

3 Comments