an illustration on concatenating dataframes. pandas can perform concatenation on those with mismatched shapes

Previously, we have discussed basic data concatenation with NumPy arrays. In Pandas, concatenating dataframes is also a thing, however with a few differences. The operation no longer requires equal shapes in the concatenate dimensions. Nevertheless, you could get unexpected results, so we will examine its behaviors carefully in this post. I have prepared several data sets to demonstrate several situations that may occur when you concatenate dataframes. Now, let us examine one by one each of the cases. You can download the notebook here.

Horizontally concatenating dataframes

First, we will discuss concatenating dataframes horizontally, which combines rows from the inputs. To do this, we use the function pandas.concat(). Actually, this function works quite differently from numpy.concatenate(). However, for the purpose of merging by rows’ positions, the end results are similar. Specifically, rows at the same positions across the inputs are still combined. There are a few caveats though, so we will go through each one of them next.

The easy case

horizontal concatenation with matched row numbers

As usual, we start by importing and aliasing Pandas. Then, we import the data files into dataframes, in this case, patient_info.csv into patient_info and patient_env.csv into patient_env. Using shape, we can see that patient_info has 20 rows 3 columns, and patient_env 20 rows 2 columns. Next, we use head() and observe that the two dataframes have totally different column names. If rows from the two dataframes belong to the same patients, this is the best case for concatenation.

patient_info.csv Download

patient_env Download

In [36]:

import pandas as pd

patient_info = pd.read_csv('patient_info.csv')
patient_env = pd.read_csv('patient_env.csv')

patient_info.shape, patient_env.shape

Out[36]:

((20, 3), (20, 2))

In [37]:

patient_info.head(n=2)

Out[37]:

	patient_id	urgent_care	hospitalized
0	10001	No	Yes
1	10002	No	No

In [38]:

patient_env.head(n=2)

Out[38]:

	work_type	residence_type
0	Private	Urban
1	Private	Rural

Horizontally concatenating these two is easy enough. We can use the function pandas.concat(axis=1). The result is a dataframe with all columns from the inputs. Again, this result only makes sense when rows at the same positions from both inputs are from the same patients, so you should always be careful when performing a horizontal concatenation. Upcoming, we have two not-as-nice cases.

In [40]:

pd.concat([patient_info, patient_env], axis=1)

Out[40]:

	patient_id	urgent_care	hospitalized	work_type	residence_type
0	10001	No	Yes	Private	Urban
1	10002	No	No	Private	Rural
2	10003	Yes	Yes	Private	Urban
…	…	…	…	…	…
17	10018	Yes	Yes	Self-employed	Urban
18	10019	Yes	Yes	Self-employed	Urban
19	10020	No	No	Private	Rural

The first complicated case

horizontal concatenating dataframes with mismatched row numbers

Now, let us see what happens if the two dataframes have mismatched number of rows. For example, I have patient_env2 with 23 rows that I want to concatenate with patient_info. Drag the ouput down to the end, you can see the last three rows having their patient_id, urgent_care, and hospitalized values as NaN. NaN means “not a number“, and is used by Pandas to indicate missing values. So, values in rows that do not get their match from the other dataframe in a horizontal concatenation become missing. This is almost a guarantee that two data sets should not be concatenated this way, so do think twice before attempting it!

patient_env2.csv Download

In [41]:

patient_env2 = pd.read_csv('patient_env2.csv')
patient_env2.shape

Out[41]:

(23, 2)

In [43]:

pd.concat([patient_info, patient_env2], axis=1)

Out[43]:

	patient_id	urgent_care	hospitalized	work_type	residence_type
0	10001.0	No	Yes	Private	Urban
1	10002.0	No	No	Private	Rural
2	10003.0	Yes	Yes	Private	Urban
…	…	…	…	…	…
18	10019.0	Yes	Yes	Self-employed	Urban
19	10020.0	No	No	Private	Rural
20	NaN	NaN	NaN	Govt_job	Urban
21	NaN	NaN	NaN	Private	Rural
22	NaN	NaN	NaN	Govt_job	Urban

The second complicated case

an illustration of concatenating dataframes that have some similar columns

Another situation to look at is when we try concatenating dataframes with columns having the same names. In this example, we use patient_info2 with 20 rows and 3 columns, however, the column urgent_care is also in patient_info. Interestingly, concatenating them creates two columns urgent_care in the result dataframe. Personally, I do not like having columns of the same names as it can cause confusions later on. This can be solved by changing the columns names before or after merging, but is inconvenient regardless. So, I will get back to this case in the next post about merge() which has more controls over columns’ names.

patient_info2.csv Download

In [46]:

patient_info2 = pd.read_csv('patient_info2.csv')
patient_info2.shape

Out[46]:

(20, 3)

In [45]:

patient_info2.head(n=2)

Out[45]:

	urgent_care	work_type	residence_type
0	No	Private	Urban
1	No	Private	Rural

In [52]:

pd.concat([patient_info, patient_info2], axis=1)

Out[52]:

	patient_id	urgent_care	hospitalized	urgent_care	work_type	residence_type
0	10001	No	Yes	No	Private	Urban
1	10002	No	No	No	Private	Rural
2	10003	Yes	Yes	Yes	Private	Urban
…	…	…	…	…	…	…
18	10019	Yes	Yes	Yes	Self-employed	Urban
19	10020	No	No	No	Private	Rural

This case can surely combine with the first one on mismatched number of rows. In that scenario, you should really reconsider concatenation.

Vertically concatenating dataframes

an illustration of vertical concatenation

In vertical concatenting dataframes, we join their columns. Here, the shapes of the inputs matter less because pandas.concat() focuses on columns’ names. In the end result, columns that exist in both inputs will be merged, and values in columns that do not have their counterpart become NaN.

For demonstration, let us use patient_info3.csv. As you can see, the dataframe has 9 rows and 4 columns. patient_id, urgent_care, and hospitalized are in patient_info, but in_state is not. Now, to concatenate these two vertically, we still use pandas.concat(), however with axis=0. After concatenation, in_state of all patients from patient_info are NaN since the dataframe does not have that column to begin with. The other threes merged without any issues.

patient_info3.csv Download

In [55]:

patient_info3 = pd.read_csv('patient_info3.csv')
patient_info3.shape

Out[55]:

(9, 4)

In [57]:

patient_info3.head(n=2)

Out[57]:

	patient_id	urgent_care	hospitalized	in_state
0	10021	No	Yes	Yes
1	10022	No	Yes	Yes

In [59]:

pd.concat([patient_info,patient_info3], axis=0)

Out[59]:

	patient_id	urgent_care	hospitalized	in_state
0	10001	No	Yes	NaN
1	10002	No	No	NaN
2	10003	Yes	Yes	NaN
…	…	…	…	…
18	10019	Yes	Yes	NaN
19	10020	No	No	NaN
0	10021	No	Yes	Yes
1	10022	No	Yes	Yes
2	10023	Yes	Yes	Yes
…	…	…	…	…
7	10028	No	No	Yes
8	10029	No	No	No

Concatenating dataframes vertically is safer and is quite common. You may get data consisting different sets of objects in many analysis. Furthermore, dealing with missing values is also easy enough.

What’s next?

I planned to use this post to discuss both concatenation and merging. However as it turns out, concatenation is fairly complicated on its own. Since this is an important operation, I decided to spent all this post on it. So, in the next one, we will explore merging dataframes. See you again!

Concatenating Dataframes