Handle Skewed Data

an illustration on log and boxcox transformation to handle skewed data

A while ago, we discussed distributions of numeric columns. Depending on the types of analysis, sometimes a symmetrical distribution is preferred over a skewed one. So, in this post, I will go through a few ways to handle skewed data and make them symmetric. Now, let us not wait any more and start!

Symmetric and skewed distributions

To be exact, most of the times we do not just want a symmetric distribution, but rather a normal distribution, or Gaussian distribution. However, defining what a normal distribution is would go over the practical scope that I set for this blog. So, let us just call them symmetric distributions. To roughly determine symmetricities, we can look at data histograms. In a symmetric distribution, the majority of data should lie in the middle with a half on each side, and the two tails should have similar lengths while not too long. On the other hand, a skewed distribution has the majority of data leaned to one side with one tail being longer. From left to right below, we have a symmetric and a skewed distribution. Of course, we cannot expect perfect symmetricity in real world data, but a bit of flaws here and there is acceptable.

And why do we want symmetricity? To be brief, a lot analytical models rely on your data being normal, and the closer to normality a distribution, the better they behave. But then again, these are usually statistical models. If we work more with machine learning ones, these assumptions are less demanded. Regardless, it could be useful and easier for the models to learn if data distributions are not too extreme, so a bit transformation to handle skewed data will not hurt.

Data to demonstrate

In this post, I will use the students-skewed.csv data which has four skewed columns. The complete Jupyter notebook is available here, and you can download the data below. Back to Python, like usual, we start with importing the necessary libraries and read the data into a Pandas dataframe. Next, we examine the descriptive statistics and histograms to verify that all four columns are fairly to very skewed. So, next, we will discuss two ways to address this.

Log transformation

A log (natural logarithm) transformation is an easy way to handle skewed data, and it works quite well a lot of times. What do we do here? Well, we apply the logarithm function on the data, and that is it! In Python, we simply use numpy.log() and store the new data anywhere. As you can see, it takes one line of codes to perform the transformation. The other two lines are for plotting. In terms of result, log transformations is able to fix skewness in FamilyIncome and TutorSessions. However, The log versions of AccumCredit and GPA are still very skewed. For such difficult data, we can use a Boxcox transformation.

Boxcox transformation

Boxcox transformation is the general version of a log transformation with the formula below

new\_value = \dfrac{old\_value^\lambda - 1}{\lambda}

with λ (the greek character lambda) being a parameter that can be optimized to give the most possible normality. When λ is 0, Boxcox is set to a log transformation. In Python, we need to use the boxcox() function from the SciPy library. Do not worry if you have not install it though, SciPy is automatically installed along with NumPy. While the function can automatically chooses lambda for us, it is a bit unfortunate that boxcox() only works for one-dimensional vectors, so we need to write a loop to manually apply the transformation. An example in the students data is as below. Now you can see, all the distributions are more or less symmetric except for AccumCredit which still has some improvement regardless. So, if you absolutely want the closest to normality as possible, you can try this method.

Conclusion

In this post, we have learned two common tools for transform a skewed distribution into being more symmetrical. Again, the necessity of this transformation is dependent on the type of analysis that you are doing. They are more important in statistical ones whereas machine learning models are more flexible towards distributions. Now, I will end this post here. See you again soon!