So far, we have only been discussing operations with numbers, so you may start wondering if we would ever talk about text data, right? Sure, why don’t we do that now? The reason I have not been mentioning text data is because, in tabular data, they are usually not as complicated and mostly treated as categorical. Regardless, we should learn about text data in Pandas in case we ever need it. Plus, Pandas provides a big module just for text processing, so it will be a waste not to use. Now, what are we waiting for?
Pandas type for text data
Before we start, you can download the complete notebook here. By default, Pandas read columns having texts into dataframes as object
. We can simply try importing any data sets that we have been using and check their types with info()
, for example, the students_standing.csv
data. Notice how FirstName
, LastName
, and Standing
have their Dtype
listed as object
.
import pandas as pd
students = pd.read_csv('students_standing.csv')
students.head(n=3)
StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | |
---|---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 | average |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average |
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good |
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 200 non-null int64 1 FirstName 200 non-null object 2 LastName 200 non-null object 3 HSGPA 200 non-null float64 4 FYGPA 200 non-null float64 5 Standing 200 non-null object dtypes: float64(2), int64(1), object(3) memory usage: 9.5+ KB
However, the official Pandas type for text is string
, not object
. The reason is that object
is a general type that can be anything, and string
is specifically for text data. While we can apply all string operations on object data, string is still preferable per the Pandas documentation. Plus, casting is easy enough with astype()
: dataframe[column] = dataframe[column].astype('string')
, so let us do that anyway. In the cell below, first, I create a list to_string_cols
to contain all columns to be casted. Note that I include StudentID here because it is not an actual numeric column as we discussed earlier.
to_string_cols = ['StudentID','FirstName','LastName','Standing']
students[to_string_cols] = students[to_string_cols].astype('string')
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 200 non-null string 1 FirstName 200 non-null string 2 LastName 200 non-null string 3 HSGPA 200 non-null float64 4 FYGPA 200 non-null float64 5 Standing 200 non-null string dtypes: float64(2), string(4) memory usage: 9.5 KB
Now, we should be ready to start working with strings!
Simple operations with text data
One of the easiest things we can do with strings is to combine them with the +
operator. The result of adding strings is a new one with their combined contents, for example, 'hello' + ' ' + 'world'
is 'hello world'
. For demonstration, I will create a FullName
column as FirstName + ' ' + LastName
:
students['FullName'] = students['FirstName'] + ' ' + students['LastName']
students.head(n=3)
StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | FullName | |
---|---|---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 | average | Eunice Ehmann |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average | Hobert Schoenberger |
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good | Nicholas Sizer |
Besides adding, you can actually multiply strings. However, this interaction is only between strings and integer numbers. It simply repeats the string for the number of times to which they multiplied. For example, 'hello' * 3
becomes 'hellohellohello'
. One usage of this operator is to pad the strings with some number of characters, however, we will return to this in a bit since it requires the use of the str
property.
The str
property
Pandas provide a big toolset for text data through the str
property that can be called from a string column. To begin, str
allows text values to be indexed – we can access their characters using a list-like slicing method. For example, I can get the first four letters of the students’ FirstName
:
students['FirstName'].str[:4]
0 Euni 1 Hobe 2 Nich 3 Elvi ... 197 Thad 198 Wrig 199 Sidn Name: FirstName, Length: 200, dtype: object
My organization actually creates employees’ emails by combining the first letter in their first names, first five letters in their last names, and the last five of the IDs. So, let us try replicate that. By the way, we can split a long line of codes into multiple lines using “\” like in my example below.
students['Email'] = students['FirstName'].str[0] + students['LastName'].str[:5] \
+ students['StudentID'].str[-5:] + '@work.com'
students.head(3)
StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | FullName | ||
---|---|---|---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 | average | Eunice Ehmann | [email protected] |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average | Hobert Schoenberger | [email protected] |
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good | Nicholas Sizer | [email protected] |
str
also allows accesses to numerous string functions. Some useful ones are
– lower()
and upper()
that change all characters to lowercase or uppercase
– removeprefix()
and removesuffix()
to remove prefixes or suffixes from texts if applicable
– len()
gives the number of characters in the strings
… and many more. You can access the full list of functions at the Pandas documentation.
students['StudentID'].str.removeprefix('2020')
0 05537 1 08560 2 04948 ... 198 01120 199 09418 Name: StudentID, Length: 200, dtype: string
students['Email'].str.removesuffix('.com')
0 EEhman05537@work 1 HSchoe08560@work 2 NSizer04948@work ... 198 WMarbu01120@work 199 SSienk09418@work Name: Email, Length: 200, dtype: string
students['FullName'].str.len()
0 13 1 19 2 14 .. 198 16 199 18 Name: FullName, Length: 200, dtype: Int64
Split strings into columns
It occurs quite often that you have data with multiple categories consolidated into one single column. For instances, a movie data set may have a genre
column with values like comedy,romance
, romance,action,thriller
, and so on. In such situation, we most likely have to split the values into their own columns for further processing. Fortunately, Pandas know exactly this, and provides us with a split()
function! split()
takes the delimiter character as an input, and we also need to set expand=True
to have new columns (otherwise the result ends up as a list). For demonstration, let me split FullName
using space ' '
, since we do not have a genre column for students.
students['FullName'].str.split(' ', expand=True)
0 | 1 | |
---|---|---|
0 | Eunice | Ehmann |
1 | Hobert | Schoenberger |
… | … | … |
198 | Wright | Marburger |
199 | Sidney | Sienkiewicz |
200 rows × 2 columns
Simple search in text data
Occasionally, we may need to perform pattern searching in a text column. For examples, find customers’ reviews that contain terms like “good”, “bad”, “average”, “excellent”, etc. While pattern searching by itself can be a good topic that last several posts, we can start with some simple search with the str.contains()
function. contains()
takes an input pattern
that represents the term you want to search in your data. The result is a boolearn column with True
for a match found, and False
otherwise. Like any boolean
arrays, we can use the result of contains to slice the dataframe. For example, I will find all students that have the pattern ‘er’ in their full names:
students['FullName'].str.contains('er')
0 False 1 True 2 True ... 198 True 199 False Name: FullName, Length: 200, dtype: boolean
students.loc[students['FullName'].str.contains('er'), :]
StudentID | FirstName | LastName | HSGPA | FYGPA | Standing | FullName | ||
---|---|---|---|---|---|---|---|---|
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 | average | Hobert Schoenberger | [email protected] |
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 | good | Nicholas Sizer | [email protected] |
7 | 202005622 | Leroy | Pasha | 2.22 | 2.20 | average | Leroy Pasha | [email protected] |
… | … | … | … | … | … | … | … | … |
188 | 202000076 | Jefferson | Shellman | 3.59 | 2.76 | average | Jefferson Shellman | [email protected] |
189 | 202008437 | Anna | Berkley | 2.90 | 2.78 | average | Anna Berkley | [email protected] |
198 | 202001120 | Wright | Marburger | 2.41 | 2.11 | average | Wright Marburger | [email protected] |
A search may sometimes come with a replace. Accordingly, the str property gives us a replace()
function as well. Its syntax is replace(old_value, new_value)
, with old_value
being the pattern to be replaced in the inputs, and new_value being the replacing pattern. So let me try changing all the 'er'
in students full names to 'or'
:
students['FullName'].str.replace('er','or')
0 Eunice Ehmann 1 Hobort Schoenborgor 2 Nicholas Sizor ... 198 Wright Marburgor 199 Sidney Sienkiewicz Name: FullName, Length: 200, dtype: string
Conclusion
Text data has a lot going for it. In this post, I have just introduced some basics on working with text with Pandas. Hopefully, it helps you somewhat to form an idea on the data type. Please note that this post is only on string fields in tabular data, not the type of analysis that focuses on text that we will also discuss in a bit. So, until next time!