an illustration of merging dataframes using matching keys and the four types of treating data without matches

While certainly useful in some cases, concatenating dataframes is fairly problematic because of its strict requirement on row orders. You may end up with wrong and meaningless results even with just one extra row in one of the inputs. To ensure that joined objects indeed belong to each other, merging dataframes should be done with a matching key, which is the topic of this post. So, let us get started!

Merging dataframes with matching keys

an illustration of matching rows that have the same keys

Matching key means a column that can identify objects in the data and presents in both of the dataframes. Examples of these are ID columns such as student ID, employee ID, patient ID, product ID, transaction ID, etc. For demonstration, let us examine Patient ID in the image above. Patient ID defines the patients in the data, and presents in both side of the merge. So, now we can merge the two dataframes by matching Patient ID of the rows. Specifically, rows that have the same Patient ID are combined as below.

As you can see, this method of merging provides much more security than concatenation. Objects in the dataframes can come in any orders, the result is always correct because only rows with matching keys combine. In Pandas, we use the merge() function to join dataframes. We call merge() from one dataframe with the other being input. If the key columns in both side have the same name, we use the option on='<key_name>‘. Now, let us replicate the above example in Pandas. The complete notebook is available here. The two data sets are patients_name.csv and patients_status.csv.

patients_name.csv Download

patients_status.csv Download

In [81]:

import pandas as pd

patients_name = pd.read_csv('patients_name.csv')
patients_status = pd.read_csv('patients_status.csv')

In [85]:

patients_name.head(n=2)

Out[85]:

	PatientID	FirstName	LastName
0	p0002162	Alice	Smith
1	p0010503	Bob	Williams

In [86]:

patients_status.head(n=2)

Out[86]:

	PatientID	UrgentCare	Hospitalized
0	p0010650	Yes	No
1	p0060251	Yes	Yes

And to merge the two:

In [84]:

patients_name.merge(patients_status, on='PatientID')

Out[84]:

	PatientID	FirstName	LastName	UrgentCare	Hospitalized
0	p0002162	Alice	Smith	Yes	No
1	p0010503	Bob	Williams	No	Yes
2	p0060251	Carol	Vinay	Yes	Yes
3	p0000515	Daniel	Traver	No	No
4	p0010650	Emma	Greene	Yes	No

Type of merging dataframes

an illustration of merging dataframes with some rows that have no matching keys

So, how about rows without any matches? There are four different ways of treating them, resulting in four different types of merging (or joining) that we will now discuss. Throughout this section, I will demonstrate Pandas codes with the two data sets above, patients_name_2.csv and patients_status_2.csv.

patients_name_2.csv Download

patients_status_2.csv Download

In [87]:

patients_name = pd.read_csv('patients_name_2.csv')
patients_status = pd.read_csv('patients_status_2.csv')

In [88]:

patients_name

Out[88]:

	PatientID	FirstName	LastName
0	p0002162	Alice	Smith
1	p0010503	Bob	Williams
2	p0060251	Carol	Vinay
3	p0000515	Daniel	Traver

In [89]:

patients_status

Out[89]:

	PatientID	UrgentCare	Hospitalized
0	p0060251	Yes	Yes
1	p0000515	No	No
2	p0010503	No	Yes
3	p0010650	Yes	No

Inner join

In an inner join, only rows that have matching IDs present in the result. Rows without matches from either dataframes are discarded. To perform an inner join with Pandas, we add the option how='inner' to merge().

In [90]:

patients_name.merge(patients_status, how='inner', on='PatientID')

Out[90]:

	PatientID	FirstName	LastName	UrgentCare	Hospitalized
0	p0010503	Bob	Williams	No	Yes
1	p0060251	Carol	Vinay	Yes	Yes
2	p0000515	Daniel	Traver	No	No

Outer join

In an outer join, all rows stay in the results regardless of having matching keys or not. Rows without matches from either dataframes have their missing data filled with NaN. We use the option how='outer' to perform an outer merge.

In [91]:

patients_name.merge(patients_status, how='outer', on='PatientID')

Out[91]:

	PatientID	FirstName	LastName	UrgentCare	Hospitalized
0	p0002162	Alice	Smith	NaN	NaN
1	p0010503	Bob	Williams	No	Yes
2	p0060251	Carol	Vinay	Yes	Yes
3	p0000515	Daniel	Traver	No	No
4	p0010650	NaN	NaN	Yes	No

Left join and Right join

First, let us define the left and right direction. The dataframe where we call merge() is the left one, and the dataframe that is the input of merge() is the right one. In other words, the syntax is always left_dataframe.merge(right_dataframe, <other options>). So, a left or right join keeps all rows with matched keys and rows without matches from the left or right dataframe, respectively. Like outer joins, rows with missing parts are filled with NaN. To use either types, we set the option how='left' or how='right' accordingly.

Merging dataframes with different key names

Lastly, how about merging dataframes of which the key columns have different names? It is actually very easy. Simply replace on= with left='<left_key>' and right='<right_key>' with left and right dataframes defined as previously. For examples, in the patients_status_3.csv data, the key is CaseID instead of PatientID:

patients_status_3.csv Download

In [95]:

patients_status = pd.read_csv('patients_status_3.csv')
patients_status.head(n=2)

Out[95]:

	CaseID	UrgentCare	Hospitalized
0	p0060251	Yes	Yes
1	p0000515	No	No

In [96]:

patients_name.merge(patients_status, left_on='PatientID', right_on='CaseID')

Out[96]:

	PatientID	FirstName	LastName	CaseID	UrgentCare	Hospitalized
0	p0010503	Bob	Williams	p0010503	No	Yes
1	p0060251	Carol	Vinay	p0060251	Yes	Yes
2	p0000515	Daniel	Traver	p0000515	No	No

Of course, in this case, you really need to verify that the two keys indeed refer to the same things. Otherwise, your merged data is pretty much meaningless!

Conclusion

In this post, I discussed the concept of merging dataframes with matching keys, as well as the different types of treating data without matches. Actually, there is a cross-join type as well. However, it is not for the exact same purpose with the other four types, so I will leave that for later. After this post, we should obtain enough knowledge on handling Pandas dataframes and NumPy arrays, and are now ready to start some analysis. See you later then!