Concatenating Dataframes

an illustration on concatenating dataframes. pandas can perform concatenation on those with mismatched shapes

Previously, we have discussed basic data concatenation with NumPy arrays. In Pandas, concatenating dataframes is also a thing, however with a few differences. The operation no longer requires equal shapes in the concatenate dimensions. Nevertheless, you could get unexpected results, so we will examine its behaviors carefully in this post. I have prepared several data sets to demonstrate several situations that may occur when you concatenate dataframes. Now, let us examine one by one each of the cases. You can download the notebook here.

Horizontally concatenating dataframes

First, we will discuss concatenating dataframes horizontally, which combines rows from the inputs. To do this, we use the function pandas.concat(). Actually, this function works quite differently from numpy.concatenate(). However, for the purpose of merging by rows’ positions, the end results are similar. Specifically, rows at the same positions across the inputs are still combined. There are a few caveats though, so we will go through each one of them next.

The easy case

horizontal concatenation with matched row numbers

As usual, we start by importing and aliasing Pandas. Then, we import the data files into dataframes, in this case, patient_info.csv into patient_info and patient_env.csv into patient_env. Using shape, we can see that patient_info has 20 rows 3 columns, and patient_env 20 rows 2 columns. Next, we use head() and observe that the two dataframes have totally different column names. If rows from the two dataframes belong to the same patients, this is the best case for concatenation.

Horizontally concatenating these two is easy enough. We can use the function pandas.concat(axis=1). The result is a dataframe with all columns from the inputs. Again, this result only makes sense when rows at the same positions from both inputs are from the same patients, so you should always be careful when performing a horizontal concatenation. Upcoming, we have two not-as-nice cases.

The first complicated case

horizontal concatenating dataframes with mismatched row numbers

Now, let us see what happens if the two dataframes have mismatched number of rows. For example, I have patient_env2 with 23 rows that I want to concatenate with patient_info. Drag the ouput down to the end, you can see the last three rows having their patient_id, urgent_care, and hospitalized values as NaN. NaN means “not a number“, and is used by Pandas to indicate missing values. So, values in rows that do not get their match from the other dataframe in a horizontal concatenation become missing. This is almost a guarantee that two data sets should not be concatenated this way, so do think twice before attempting it!

The second complicated case

an illustration of concatenating dataframes that have some similar columns

Another situation to look at is when we try concatenating dataframes with columns having the same names. In this example, we use patient_info2 with 20 rows and 3 columns, however, the column urgent_care is also in patient_info. Interestingly, concatenating them creates two columns urgent_care in the result dataframe. Personally, I do not like having columns of the same names as it can cause confusions later on. This can be solved by changing the columns names before or after merging, but is inconvenient regardless. So, I will get back to this case in the next post about merge() which has more controls over columns’ names.

This case can surely combine with the first one on mismatched number of rows. In that scenario, you should really reconsider concatenation.

Vertically concatenating dataframes

an illustration of vertical concatenation

In vertical concatenting dataframes, we join their columns. Here, the shapes of the inputs matter less because pandas.concat() focuses on columns’ names. In the end result, columns that exist in both inputs will be merged, and values in columns that do not have their counterpart become NaN.

For demonstration, let us use patient_info3.csv. As you can see, the dataframe has 9 rows and 4 columns. patient_id, urgent_care, and hospitalized are in patient_info, but in_state is not. Now, to concatenate these two vertically, we still use pandas.concat(), however with axis=0. After concatenation, in_state of all patients from patient_info are NaN since the dataframe does not have that column to begin with. The other threes merged without any issues.

Concatenating dataframes vertically is safer and is quite common. You may get data consisting different sets of objects in many analysis. Furthermore, dealing with missing values is also easy enough.

What’s next?

I planned to use this post to discuss both concatenation and merging. However as it turns out, concatenation is fairly complicated on its own. Since this is an important operation, I decided to spent all this post on it. So, in the next one, we will explore merging dataframes. See you again!

1 Comment

Comments are closed