So far, we have only been discussing operations with numbers, so you may start wondering if we would ever talk about text data, right? Sure, why don’t we do that now? The reason I have not been mentioning text data is because, in tabular data, they are usually not as complicated and mostly treated as categorical. Regardless, we should learn about text data in Pandas in case we ever need it. Plus, Pandas provides a big module just for text processing, so it will be a waste not to use. Now, what are we waiting for?

Pandas type for text data

Before we start, you can download the complete notebook here. By default, Pandas read columns having texts into dataframes as object. We can simply try importing any data sets that we have been using and check their types with info(), for example, the students_standing.csv data. Notice how FirstName, LastName, and Standing have their Dtype listed as object.

students_standing.csv Download

In [46]:

import pandas as pd

students = pd.read_csv('students_standing.csv')
students.head(n=3)

Out[46]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing
0	202005537	Eunice	Ehmann	2.47	2.42	average
1	202008560	Hobert	Schoenberger	2.27	2.05	average
2	202004948	Nicholas	Sizer	4.00	3.96	good

In [47]:

students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   StudentID  200 non-null    int64  
 1   FirstName  200 non-null    object 
 2   LastName   200 non-null    object 
 3   HSGPA      200 non-null    float64
 4   FYGPA      200 non-null    float64
 5   Standing   200 non-null    object 
dtypes: float64(2), int64(1), object(3)
memory usage: 9.5+ KB

However, the official Pandas type for text is string, not object. The reason is that object is a general type that can be anything, and string is specifically for text data. While we can apply all string operations on object data, string is still preferable per the Pandas documentation. Plus, casting is easy enough with astype(): dataframe[column] = dataframe[column].astype('string'), so let us do that anyway. In the cell below, first, I create a list to_string_cols to contain all columns to be casted. Note that I include StudentID here because it is not an actual numeric column as we discussed earlier.

In [55]:

to_string_cols = ['StudentID','FirstName','LastName','Standing']
students[to_string_cols] = students[to_string_cols].astype('string')
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   StudentID  200 non-null    string 
 1   FirstName  200 non-null    string 
 2   LastName   200 non-null    string 
 3   HSGPA      200 non-null    float64
 4   FYGPA      200 non-null    float64
 5   Standing   200 non-null    string 
dtypes: float64(2), string(4)
memory usage: 9.5 KB

Now, we should be ready to start working with strings!

Simple operations with text data

One of the easiest things we can do with strings is to combine them with the + operator. The result of adding strings is a new one with their combined contents, for example, 'hello' + ' ' + 'world' is 'hello world'. For demonstration, I will create a FullName column as FirstName + ' ' + LastName:

In [21]:

students['FullName'] = students['FirstName'] + ' ' + students['LastName']
students.head(n=3)

Out[21]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	FullName
0	202005537	Eunice	Ehmann	2.47	2.42	average	Eunice Ehmann
1	202008560	Hobert	Schoenberger	2.27	2.05	average	Hobert Schoenberger
2	202004948	Nicholas	Sizer	4.00	3.96	good	Nicholas Sizer

Besides adding, you can actually multiply strings. However, this interaction is only between strings and integer numbers. It simply repeats the string for the number of times to which they multiplied. For example, 'hello' * 3 becomes 'hellohellohello'. One usage of this operator is to pad the strings with some number of characters, however, we will return to this in a bit since it requires the use of the str property.

The `str` property

Pandas provide a big toolset for text data through the str property that can be called from a string column. To begin, str allows text values to be indexed – we can access their characters using a list-like slicing method. For example, I can get the first four letters of the students’ FirstName:

In [25]:

students['FirstName'].str[:4]

Out[25]:

0      Euni
1      Hobe
2      Nich
3      Elvi
       ... 
197    Thad
198    Wrig
199    Sidn
Name: FirstName, Length: 200, dtype: object

My organization actually creates employees’ emails by combining the first letter in their first names, first five letters in their last names, and the last five of the IDs. So, let us try replicate that. By the way, we can split a long line of codes into multiple lines using “\” like in my example below.

In [62]:

students['Email'] = students['FirstName'].str[0] + students['LastName'].str[:5] \
    + students['StudentID'].str[-5:] + '@work.com'
students.head(3)

Out[62]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	FullName	Email
0	202005537	Eunice	Ehmann	2.47	2.42	average	Eunice Ehmann	[email protected]
1	202008560	Hobert	Schoenberger	2.27	2.05	average	Hobert Schoenberger	[email protected]
2	202004948	Nicholas	Sizer	4.00	3.96	good	Nicholas Sizer	[email protected]

str also allows accesses to numerous string functions. Some useful ones are
– lower() and upper() that change all characters to lowercase or uppercase
– removeprefix() and removesuffix() to remove prefixes or suffixes from texts if applicable
– len() gives the number of characters in the strings
… and many more. You can access the full list of functions at the Pandas documentation.

In [66]:

students['StudentID'].str.removeprefix('2020')

Out[66]:

0      05537
1      08560
2      04948
       ...  
198    01120
199    09418
Name: StudentID, Length: 200, dtype: string

In [67]:

students['Email'].str.removesuffix('.com')

Out[67]:

0      EEhman05537@work
1      HSchoe08560@work
2      NSizer04948@work
             ...       
198    WMarbu01120@work
199    SSienk09418@work
Name: Email, Length: 200, dtype: string

In [68]:

students['FullName'].str.len()

Out[68]:

0      13
1      19
2      14
       ..
198    16
199    18
Name: FullName, Length: 200, dtype: Int64

Split strings into columns

It occurs quite often that you have data with multiple categories consolidated into one single column. For instances, a movie data set may have a genre column with values like comedy,romance, romance,action,thriller, and so on. In such situation, we most likely have to split the values into their own columns for further processing. Fortunately, Pandas know exactly this, and provides us with a split() function! split() takes the delimiter character as an input, and we also need to set expand=True to have new columns (otherwise the result ends up as a list). For demonstration, let me split FullName using space ' ', since we do not have a genre column for students.

In [71]:

students['FullName'].str.split(' ', expand=True)

Out[71]:

	0	1
0	Eunice	Ehmann
1	Hobert	Schoenberger
…	…	…
198	Wright	Marburger
199	Sidney	Sienkiewicz

200 rows × 2 columns

Simple search in text data

Occasionally, we may need to perform pattern searching in a text column. For examples, find customers’ reviews that contain terms like “good”, “bad”, “average”, “excellent”, etc. While pattern searching by itself can be a good topic that last several posts, we can start with some simple search with the str.contains() function. contains() takes an input pattern that represents the term you want to search in your data. The result is a boolearn column with True for a match found, and False otherwise. Like any boolean arrays, we can use the result of contains to slice the dataframe. For example, I will find all students that have the pattern ‘er’ in their full names:

In [69]:

students['FullName'].str.contains('er')

Out[69]:

0      False
1       True
2       True
       ...  
198     True
199    False
Name: FullName, Length: 200, dtype: boolean

In [70]:

students.loc[students['FullName'].str.contains('er'), :]

Out[70]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	FullName	Email
1	202008560	Hobert	Schoenberger	2.27	2.05	average	Hobert Schoenberger	[email protected]
2	202004948	Nicholas	Sizer	4.00	3.96	good	Nicholas Sizer	[email protected]
7	202005622	Leroy	Pasha	2.22	2.20	average	Leroy Pasha	[email protected]
…	…	…	…	…	…	…	…	…
188	202000076	Jefferson	Shellman	3.59	2.76	average	Jefferson Shellman	[email protected]
189	202008437	Anna	Berkley	2.90	2.78	average	Anna Berkley	[email protected]
198	202001120	Wright	Marburger	2.41	2.11	average	Wright Marburger	[email protected]

A search may sometimes come with a replace. Accordingly, the str property gives us a replace() function as well. Its syntax is replace(old_value, new_value), with old_value being the pattern to be replaced in the inputs, and new_value being the replacing pattern. So let me try changing all the 'er' in students full names to 'or':

In [72]:

students['FullName'].str.replace('er','or')

Out[72]:

0            Eunice Ehmann
1      Hobort Schoenborgor
2           Nicholas Sizor
              ...         
198       Wright Marburgor
199     Sidney Sienkiewicz
Name: FullName, Length: 200, dtype: string

Conclusion

Text data has a lot going for it. In this post, I have just introduced some basics on working with text with Pandas. Hopefully, it helps you somewhat to form an idea on the data type. Please note that this post is only on string fields in tabular data, not the type of analysis that focuses on text that we will also discuss in a bit. So, until next time!