an illustration of methods to encode categorical data which are one hot encoder and ordinal encoder

Categories are a big part in tabular data. You will see them more often than not, and it is just inevitable. However, a lot of analytical models cannot handle categorical data by itself. In those cases, you need to find ways to transform classes into meaningful numbers. And, that is the topic of this post. We will discuss two different methods to encode categorical data.

Types of categorical data

We discussed different types of categorical quite some times ago. Let me quickly summarize it so we can move on. There are majorly three types: binary, ordinal, and nominal.

Binary data means that there are exactly two unique values in the column. Examples of this type are agree or disagree, good or bad, low or high, etc. The two values in binary categories are usually transformed into 0 and 1, orders do not matter. There are different ways to do this, but a convenient method is to process binary data with an ordinal encoder.

Ordinal data means categories that have inherent orders. More specifically, you can sort them in some comparative ways. For example, [very bad, bad, okay, good, very good], [low, medium, high], or [very disagree, disagree, neutral, agree, very agree]. Since these categories have orders, we can somewhat transform them into numbers like 1, 2, 3... However, you should proceed with cares as we will discuss later in this post.

Nominal data means categories that does not have inherent orders. Therefore, you cannot compare the categories or sort them (alphabetic order does not count!). Some examples are states, cities, postal codes where you cannot compare, says, New York to New Hampshire, or Boston to Seattle. One way of transforming nominal data to numbers is a technique called One Hot Encoder or Dummy Variables.

Data to demonstrate

In this post, I will use the sample_categorical_data.csv file. This is a small synthetic data set with all columns being categorical. The complete Jupyter notebook for this post is available here. To begin, we load the data, print its info(), and draw a bar chart for each column so we can observe their distributions. Based on the classes of each column, we have a binary column which is Area, ordinal columns AgeGroup and Opinion, and nominal ones State and EmploymentSector. Another note is that State and EmploymentSector have some rare values.

sample-categorical-data.csv Download

In [5]:

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('sample-categorical-data.csv')

In [6]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   AgeGroup          33 non-null     object
 1   State             33 non-null     object
 2   Area              33 non-null     object
 3   EmploymentSector  33 non-null     object
 4   Opinion           33 non-null     object
dtypes: object(5)
memory usage: 1.4+ KB

In [2]:

for col in data.columns:
    print(col)
    data[col].value_counts().plot.bar(rot=30, figsize=(4,4))
    plt.show()

AgeGroup

State

Area

EmploymentSector

Opinion

One hot encoder

We can apply this method to encode categorical data of any types, however, with some caveats. First, the encoder will create a new binary column for each distinct value in the original categorical column. For example, if a class column has three unique values, for example, (low, medium, high), the one hot codes will have three new binary columns. Among these three, the value of the one corresponding to the original class value is 1, and 0 otherwise.

Below is one illustrative example. The State column has three unique values, New York, Massachusetts, and New Hampshire. Its one hot codes would then include three new binary columns, is_NY corresponding to New York, is_MA for Massachusetts, and is_NH for New Hampshire. Now, for each row, its one hot field corresponding to the original class is set to 1, and the rest 0. Specifically, the first row has State as New York, so its is_NY is 1, and both is_MA and is_NH 0; the third row is Massachusetts, so its is_MA is 1 and the rest 0; and so on. Also, the last column can be omitted because we know the state is New Hampshire if both is_NY and is_MA are 0.

State

New York

Massachusetts

New Hampshire

Massachusetts

is_NY	is_MA	is_NH
1	0	0
1	0	0
0	1	0
0	0	1
0	1	0

The sparsity problem

One potential with one hot encoder is its susceptibility to a large number of classes. As you can see, this method create one new column for each unique class in data. So, some thousands of classes means that number of new columns. Furthermore, each row will only have a single 1 and the rest are 0. So, your data can get very big, however, a majority of them is 0, and just a small proportion is 1. We call this issue very sparse data and it is particularly bad.

So, I would recommend to use one hot encoder only if you have a few hundred of classes. What if you have more? There are a few ways, but the easiest is to keep only frequent classes, and collapse everything else into a single other class. For example, you can set a filter like only classes that are in more than 1%, 5%, or 10% of rows are frequent.

In Python

We can either use Pandas or SKLearn for one hot encoder. However, I will use SKLearn since it is easier to incorporate in a pipeline that we will discuss in the future. In this case, we will use the class model OneHotEncoder. However, there are a few things we have to set up to begin. First, we need to create a list of columns to perform one hot encoder. We can use this on ordinal data just fine, but, I will only include State and EmploymentSector and leave the rest for later. Next, we set a frequency threshold to define rare classes, which I call infreq_threshold in the code. 0.1 means only keeping classes that have frequencies over 10% of data.

Now, we can create an one hot encoder. We add are two options here, min_frequency for defining rare classes by the encoder which is the actual number for 10% of data calculated from infreq_threshold. Next, handle_unknown is set to infrequent_if_exist so that the rare classes form a new column. Finally, we call fit_transform() to train and apply the encoder. As you can see, the output is just binary data. Each row has exactly two values of 1 because we only transformed two columns.

In [3]:

nom_cols = ['State', 'EmploymentSector']
infreq_threshold = 0.1

from sklearn.preprocessing import OneHotEncoder

oh_encoder = OneHotEncoder(min_frequency=int(len(data) * infreq_threshold), handle_unknown='infrequent_if_exist')
oh_codes = oh_encoder.fit_transform(data[nom_cols])

oh_codes.todense()

Out[3]:

matrix([[0., 0., 1., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 0., 1., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0.],
        ...
        [0., 0., 0., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1., 0., 1., 0., 0.]])

Encoding ordinal data

As I mentioned above, we can totally use one hot encoder to encode categorical data of any types. However, there are certain cases that we can try transforming them to integer numbers like 0, 1, 2… to keep the order information. In SKLearn, we use the OrdinalEncoder model class.

Like one hot encoder, first, we create a list of columns to perform ordinal encoding. We include the binary Area here because we want it as 0 and 1. Next, we need to provide a list of ordered values for each column. In each ordered list, the first class becomes 0, second class 1, and so on. Without this, OrdinalEncoder will sort your classes alphabetically, so beware! Finally, we create the encoder with the option categories set to the ordered list then train and apply it with fit_transform(). As you can see, we have exactly three new columns, and values in them are number 0, 1, 2... corresponding to values in ord_classes.

In [4]:

ord_cols = ['AgeGroup', 'Area', 'Opinion']

ord_classes = [
    ['20 to 40', '40 to 60', 'over 60'],
    ['Rural', 'Urban'],
    ['Very Disagree', 'Disagree', 'Neutral', 'Agree', 'Very Agree']
]

from sklearn.preprocessing import OrdinalEncoder

ord_encoder = OrdinalEncoder(categories=ord_classes)
ord_codes = ord_encoder.fit_transform(data[ord_cols])

ord_codes

Out[4]:

array([[1., 0., 2.],
       [1., 0., 3.],
       [0., 0., 1.],
       [2., 0., 4.],
       ...
       [0., 1., 2.],
       [0., 1., 0.]])

Now, be very careful when you use this method on actual ordinal data. The thing is, while you can sort these classes, they may still not be actual numbers. For example, Disagree < Neutral < Agree, however, does Neutral - Disagree equal to Agree - Neutral? Because we imply that relationship when converting them to 1, 2, and 3. So, you need to carefully consider the meaning of the classes before attempting this method.

Conclusion

In this post, we discuss how to encode categorical data. While it is straight forward most of the time, you still need to be careful, for examples, when your data have too many classes, or if the ordinal classes are ambiguous as being numeric. In general, one hot encoder with controls on infrequent classes is pretty safe. There are also more advance ways to transform categorical data, but we will discuss them after getting to regression and classification. For now, happy encoding, and see you again!

Encode Categorical Data