Merging Dataframes

an illustration of merging dataframes using matching keys and the four types of treating data without matches

While certainly useful in some cases, concatenating dataframes is fairly problematic because of its strict requirement on row orders. You may end up with wrong and meaningless results even with just one extra row in one of the inputs. To ensure that joined objects indeed belong to each other, merging dataframes should be done with a matching key, which is the topic of this post. So, let us get started!

Merging dataframes with matching keys

an illustration of matching rows that have the same keys

Matching key means a column that can identify objects in the data and presents in both of the dataframes. Examples of these are ID columns such as student ID, employee ID, patient ID, product ID, transaction ID, etc. For demonstration, let us examine Patient ID in the image above. Patient ID defines the patients in the data, and presents in both side of the merge. So, now we can merge the two dataframes by matching Patient ID of the rows. Specifically, rows that have the same Patient ID are combined as below.

an illustration of a merged dataframe

As you can see, this method of merging provides much more security than concatenation. Objects in the dataframes can come in any orders, the result is always correct because only rows with matching keys combine. In Pandas, we use the merge() function to join dataframes. We call merge() from one dataframe with the other being input. If the key columns in both side have the same name, we use the option on='<key_name>‘. Now, let us replicate the above example in Pandas. The complete notebook is available here. The two data sets are patients_name.csv and patients_status.csv.

And to merge the two:

Type of merging dataframes

an illustration of merging dataframes with some rows that have no matching keys

So, how about rows without any matches? There are four different ways of treating them, resulting in four different types of merging (or joining) that we will now discuss. Throughout this section, I will demonstrate Pandas codes with the two data sets above, patients_name_2.csv and patients_status_2.csv.

Inner join

an illustration of inner join

In an inner join, only rows that have matching IDs present in the result. Rows without matches from either dataframes are discarded. To perform an inner join with Pandas, we add the option how='inner' to merge().

Outer join

an illustration of outer join

In an outer join, all rows stay in the results regardless of having matching keys or not. Rows without matches from either dataframes have their missing data filled with NaN. We use the option how='outer' to perform an outer merge.

Left join and Right join

an illustration of left join
an illustration of right join

First, let us define the left and right direction. The dataframe where we call merge() is the left one, and the dataframe that is the input of merge() is the right one. In other words, the syntax is always left_dataframe.merge(right_dataframe, <other options>). So, a left or right join keeps all rows with matched keys and rows without matches from the left or right dataframe, respectively. Like outer joins, rows with missing parts are filled with NaN. To use either types, we set the option how='left' or how='right' accordingly.

Merging dataframes with different key names

Lastly, how about merging dataframes of which the key columns have different names? It is actually very easy. Simply replace on= with left='<left_key>' and right='<right_key>' with left and right dataframes defined as previously. For examples, in the patients_status_3.csv data, the key is CaseID instead of PatientID:

Of course, in this case, you really need to verify that the two keys indeed refer to the same things. Otherwise, your merged data is pretty much meaningless!

Conclusion

In this post, I discussed the concept of merging dataframes with matching keys, as well as the different types of treating data without matches. Actually, there is a cross-join type as well. However, it is not for the exact same purpose with the other four types, so I will leave that for later. After this post, we should obtain enough knowledge on handling Pandas dataframes and NumPy arrays, and are now ready to start some analysis. See you later then!