
So far, we have only been discussing operations with numbers, so you may start wondering if we would ever talk about text data, right? Sure, why don’t we do that now? The reason I have not been mentioning text data is because, in tabular data, they are usually not as complicated and mostly treated as categorical. Regardless, we should learn about text data in Pandas in case we ever need it. Plus, Pandas provides a big module just for text processing, so it will be a waste not to use. Now, what are we waiting for?
Pandas type for text data
Before we start, you can download the complete notebook here. By default, Pandas read columns having texts into dataframes as object. We can simply try importing any data sets that we have been using and check their types with info(), for example, the students_standing.csv data. Notice how FirstName, LastName, and Standing have their Dtype listed as object.
import pandas as pd
students = pd.read_csv('students_standing.csv')
students.head(n=3)
| StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | |
|---|---|---|---|---|---|---|
| 0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 | average |
| 1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average |
| 2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good |
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 200 non-null int64 1 FirstName 200 non-null object 2 LastName 200 non-null object 3 HSGPA 200 non-null float64 4 FYGPA 200 non-null float64 5 Standing 200 non-null object dtypes: float64(2), int64(1), object(3) memory usage: 9.5+ KB
However, the official Pandas type for text is string, not object. The reason is that object is a general type that can be anything, and string is specifically for text data. While we can apply all string operations on object data, string is still preferable per the Pandas documentation. Plus, casting is easy enough with astype(): dataframe[column] = dataframe[column].astype('string'), so let us do that anyway. In the cell below, first, I create a list to_string_cols to contain all columns to be casted. Note that I include StudentID here because it is not an actual numeric column as we discussed earlier.
to_string_cols = ['StudentID','FirstName','LastName','Standing']
students[to_string_cols] = students[to_string_cols].astype('string')
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 200 non-null string 1 FirstName 200 non-null string 2 LastName 200 non-null string 3 HSGPA 200 non-null float64 4 FYGPA 200 non-null float64 5 Standing 200 non-null string dtypes: float64(2), string(4) memory usage: 9.5 KB
Now, we should be ready to start working with strings!
Simple operations with text data
One of the easiest things we can do with strings is to combine them with the + operator. The result of adding strings is a new one with their combined contents, for example, 'hello' + ' ' + 'world' is 'hello world'. For demonstration, I will create a FullName column as FirstName + ' ' + LastName:
students['FullName'] = students['FirstName'] + ' ' + students['LastName']
students.head(n=3)
| StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | FullName | |
|---|---|---|---|---|---|---|---|
| 0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 | average | Eunice Ehmann |
| 1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average | Hobert Schoenberger |
| 2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good | Nicholas Sizer |
Besides adding, you can actually multiply strings. However, this interaction is only between strings and integer numbers. It simply repeats the string for the number of times to which they multiplied. For example, 'hello' * 3 becomes 'hellohellohello'. One usage of this operator is to pad the strings with some number of characters, however, we will return to this in a bit since it requires the use of the str property.
The str property
Pandas provide a big toolset for text data through the str property that can be called from a string column. To begin, str allows text values to be indexed – we can access their characters using a list-like slicing method. For example, I can get the first four letters of the students’ FirstName:
students['FirstName'].str[:4]
0 Euni
1 Hobe
2 Nich
3 Elvi
...
197 Thad
198 Wrig
199 Sidn
Name: FirstName, Length: 200, dtype: object
My organization actually creates employees’ emails by combining the first letter in their first names, first five letters in their last names, and the last five of the IDs. So, let us try replicate that. By the way, we can split a long line of codes into multiple lines using “\” like in my example below.
students['Email'] = students['FirstName'].str[0] + students['LastName'].str[:5] \
+ students['StudentID'].str[-5:] + '@work.com'
students.head(3)
| StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | FullName | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 | average | Eunice Ehmann | [email protected] |
| 1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average | Hobert Schoenberger | [email protected] |
| 2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good | Nicholas Sizer | [email protected] |
str also allows accesses to numerous string functions. Some useful ones are
– lower() and upper() that change all characters to lowercase or uppercase
– removeprefix() and removesuffix() to remove prefixes or suffixes from texts if applicable
– len() gives the number of characters in the strings
… and many more. You can access the full list of functions at the Pandas documentation.
students['StudentID'].str.removeprefix('2020')
0 05537
1 08560
2 04948
...
198 01120
199 09418
Name: StudentID, Length: 200, dtype: string
students['Email'].str.removesuffix('.com')
0 EEhman05537@work
1 HSchoe08560@work
2 NSizer04948@work
...
198 WMarbu01120@work
199 SSienk09418@work
Name: Email, Length: 200, dtype: string
students['FullName'].str.len()
0 13
1 19
2 14
..
198 16
199 18
Name: FullName, Length: 200, dtype: Int64
Split strings into columns
It occurs quite often that you have data with multiple categories consolidated into one single column. For instances, a movie data set may have a genre column with values like comedy,romance, romance,action,thriller, and so on. In such situation, we most likely have to split the values into their own columns for further processing. Fortunately, Pandas know exactly this, and provides us with a split() function! split() takes the delimiter character as an input, and we also need to set expand=True to have new columns (otherwise the result ends up as a list). For demonstration, let me split FullName using space ' ', since we do not have a genre column for students.
students['FullName'].str.split(' ', expand=True)
| 0 | 1 | |
|---|---|---|
| 0 | Eunice | Ehmann |
| 1 | Hobert | Schoenberger |
| … | … | … |
| 198 | Wright | Marburger |
| 199 | Sidney | Sienkiewicz |
200 rows × 2 columns
Simple search in text data
Occasionally, we may need to perform pattern searching in a text column. For examples, find customers’ reviews that contain terms like “good”, “bad”, “average”, “excellent”, etc. While pattern searching by itself can be a good topic that last several posts, we can start with some simple search with the str.contains() function. contains() takes an input pattern that represents the term you want to search in your data. The result is a boolearn column with True for a match found, and False otherwise. Like any boolean arrays, we can use the result of contains to slice the dataframe. For example, I will find all students that have the pattern ‘er’ in their full names:
students['FullName'].str.contains('er')
0 False
1 True
2 True
...
198 True
199 False
Name: FullName, Length: 200, dtype: boolean
students.loc[students['FullName'].str.contains('er'), :]
| StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | FullName | ||
|---|---|---|---|---|---|---|---|---|
| 1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average | Hobert Schoenberger | [email protected] |
| 2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good | Nicholas Sizer | [email protected] |
| 7 | 202005622 | Leroy | Pasha | 2.22 | 2.20 | average | Leroy Pasha | [email protected] |
| … | … | … | … | … | … | … | … | … |
| 188 | 202000076 | Jefferson | Shellman | 3.59 | 2.76 | average | Jefferson Shellman | [email protected] |
| 189 | 202008437 | Anna | Berkley | 2.90 | 2.78 | average | Anna Berkley | [email protected] |
| 198 | 202001120 | Wright | Marburger | 2.41 | 2.11 | average | Wright Marburger | [email protected] |
A search may sometimes come with a replace. Accordingly, the str property gives us a replace() function as well. Its syntax is replace(old_value, new_value), with old_value being the pattern to be replaced in the inputs, and new_value being the replacing pattern. So let me try changing all the 'er' in students full names to 'or':
students['FullName'].str.replace('er','or')
0 Eunice Ehmann
1 Hobort Schoenborgor
2 Nicholas Sizor
...
198 Wright Marburgor
199 Sidney Sienkiewicz
Name: FullName, Length: 200, dtype: string
Conclusion
Text data has a lot going for it. In this post, I have just introduced some basics on working with text with Pandas. Hopefully, it helps you somewhat to form an idea on the data type. Please note that this post is only on string fields in tabular data, not the type of analysis that focuses on text that we will also discuss in a bit. So, until next time!