Text Data in Pandas

So far, we have only been discussing operations with numbers, so you may start wondering if we would ever talk about text data, right? Sure, why don’t we do that now? The reason I have not been mentioning text data is because, in tabular data, they are usually not as complicated and mostly treated as categorical. Regardless, we should learn about text data in Pandas in case we ever need it. Plus, Pandas provides a big module just for text processing, so it will be a waste not to use. Now, what are we waiting for?

Pandas type for text data

Before we start, you can download the complete notebook here. By default, Pandas read columns having texts into dataframes as object. We can simply try importing any data sets that we have been using and check their types with info(), for example, the students_standing.csv data. Notice how FirstName, LastName, and Standing have their Dtype listed as object.

However, the official Pandas type for text is string, not object. The reason is that object is a general type that can be anything, and string is specifically for text data. While we can apply all string operations on object data, string is still preferable per the Pandas documentation. Plus, casting is easy enough with astype(): dataframe[column] = dataframe[column].astype('string'), so let us do that anyway. In the cell below, first, I create a list to_string_cols to contain all columns to be casted. Note that I include StudentID here because it is not an actual numeric column as we discussed earlier.

Now, we should be ready to start working with strings!

Simple operations with text data

One of the easiest things we can do with strings is to combine them with the + operator. The result of adding strings is a new one with their combined contents, for example, 'hello' + ' ' + 'world' is 'hello world'. For demonstration, I will create a FullName column as FirstName + ' ' + LastName:

Besides adding, you can actually multiply strings. However, this interaction is only between strings and integer numbers. It simply repeats the string for the number of times to which they multiplied. For example, 'hello' * 3 becomes 'hellohellohello'. One usage of this operator is to pad the strings with some number of characters, however, we will return to this in a bit since it requires the use of the str property.

The str property

Pandas provide a big toolset for text data through the str property that can be called from a string column. To begin, str allows text values to be indexed – we can access their characters using a list-like slicing method. For example, I can get the first four letters of the students’ FirstName:

My organization actually creates employees’ emails by combining the first letter in their first names, first five letters in their last names, and the last five of the IDs. So, let us try replicate that. By the way, we can split a long line of codes into multiple lines using “\” like in my example below.

str also allows accesses to numerous string functions. Some useful ones are
lower() and upper() that change all characters to lowercase or uppercase
removeprefix() and removesuffix() to remove prefixes or suffixes from texts if applicable
len() gives the number of characters in the strings
… and many more. You can access the full list of functions at the Pandas documentation.

Split strings into columns

It occurs quite often that you have data with multiple categories consolidated into one single column. For instances, a movie data set may have a genre column with values like comedy,romance, romance,action,thriller, and so on. In such situation, we most likely have to split the values into their own columns for further processing. Fortunately, Pandas know exactly this, and provides us with a split() function! split() takes the delimiter character as an input, and we also need to set expand=True to have new columns (otherwise the result ends up as a list). For demonstration, let me split FullName using space ' ', since we do not have a genre column for students.

Simple search in text data

Occasionally, we may need to perform pattern searching in a text column. For examples, find customers’ reviews that contain terms like “good”, “bad”, “average”, “excellent”, etc. While pattern searching by itself can be a good topic that last several posts, we can start with some simple search with the str.contains() function. contains() takes an input pattern that represents the term you want to search in your data. The result is a boolearn column with True for a match found, and False otherwise. Like any boolean arrays, we can use the result of contains to slice the dataframe. For example, I will find all students that have the pattern ‘er’ in their full names:

A search may sometimes come with a replace. Accordingly, the str property gives us a replace() function as well. Its syntax is replace(old_value, new_value), with old_value being the pattern to be replaced in the inputs, and new_value being the replacing pattern. So let me try changing all the 'er' in students full names to 'or':

Conclusion

Text data has a lot going for it. In this post, I have just introduced some basics on working with text with Pandas. Hopefully, it helps you somewhat to form an idea on the data type. Please note that this post is only on string fields in tabular data, not the type of analysis that focuses on text that we will also discuss in a bit. So, until next time!