While certainly useful in some cases, concatenating dataframes is fairly problematic because of its strict requirement on row orders. You may end up with wrong and meaningless results even with just one extra row in one of the inputs. To ensure that joined objects indeed belong to each other, merging dataframes should be done with a matching key, which is the topic of this post. So, let us get started!
Merging dataframes with matching keys
Matching key means a column that can identify objects in the data and presents in both of the dataframes. Examples of these are ID columns such as student ID, employee ID, patient ID, product ID, transaction ID, etc. For demonstration, let us examine Patient ID
in the image above. Patient ID
defines the patients in the data, and presents in both side of the merge. So, now we can merge the two dataframes by matching Patient ID
of the rows. Specifically, rows that have the same Patient ID are combined as below.
As you can see, this method of merging provides much more security than concatenation. Objects in the dataframes can come in any orders, the result is always correct because only rows with matching keys combine. In Pandas, we use the merge()
function to join dataframes. We call merge()
from one dataframe with the other being input. If the key columns in both side have the same name, we use the option on='<key_name>
‘. Now, let us replicate the above example in Pandas. The complete notebook is available here. The two data sets are patients_name.csv
and patients_status.csv
.
import pandas as pd
patients_name = pd.read_csv('patients_name.csv')
patients_status = pd.read_csv('patients_status.csv')
patients_name.head(n=2)
PatientID | FirstName | LastName | |
---|---|---|---|
0 | p0002162 | Alice | Smith |
1 | p0010503 | Bob | Williams |
patients_status.head(n=2)
PatientID | UrgentCare | Hospitalized | |
---|---|---|---|
0 | p0010650 | Yes | No |
1 | p0060251 | Yes | Yes |
And to merge the two:
patients_name.merge(patients_status, on='PatientID')
PatientID | FirstName | LastName | UrgentCare | Hospitalized | |
---|---|---|---|---|---|
0 | p0002162 | Alice | Smith | Yes | No |
1 | p0010503 | Bob | Williams | No | Yes |
2 | p0060251 | Carol | Vinay | Yes | Yes |
3 | p0000515 | Daniel | Traver | No | No |
4 | p0010650 | Emma | Greene | Yes | No |
Type of merging dataframes
So, how about rows without any matches? There are four different ways of treating them, resulting in four different types of merging (or joining) that we will now discuss. Throughout this section, I will demonstrate Pandas codes with the two data sets above, patients_name_2.csv
and patients_status_2.csv
.
patients_name = pd.read_csv('patients_name_2.csv')
patients_status = pd.read_csv('patients_status_2.csv')
patients_name
PatientID | FirstName | LastName | |
---|---|---|---|
0 | p0002162 | Alice | Smith |
1 | p0010503 | Bob | Williams |
2 | p0060251 | Carol | Vinay |
3 | p0000515 | Daniel | Traver |
patients_status
PatientID | UrgentCare | Hospitalized | |
---|---|---|---|
0 | p0060251 | Yes | Yes |
1 | p0000515 | No | No |
2 | p0010503 | No | Yes |
3 | p0010650 | Yes | No |
Inner join
In an inner join, only rows that have matching IDs present in the result. Rows without matches from either dataframes are discarded. To perform an inner join with Pandas, we add the option how='inner'
to merge()
.
patients_name.merge(patients_status, how='inner', on='PatientID')
PatientID | FirstName | LastName | UrgentCare | Hospitalized | |
---|---|---|---|---|---|
0 | p0010503 | Bob | Williams | No | Yes |
1 | p0060251 | Carol | Vinay | Yes | Yes |
2 | p0000515 | Daniel | Traver | No | No |
Outer join
In an outer join, all rows stay in the results regardless of having matching keys or not. Rows without matches from either dataframes have their missing data filled with NaN
. We use the option how='outer'
to perform an outer merge.
patients_name.merge(patients_status, how='outer', on='PatientID')
PatientID | FirstName | LastName | UrgentCare | Hospitalized | |
---|---|---|---|---|---|
0 | p0002162 | Alice | Smith | NaN | NaN |
1 | p0010503 | Bob | Williams | No | Yes |
2 | p0060251 | Carol | Vinay | Yes | Yes |
3 | p0000515 | Daniel | Traver | No | No |
4 | p0010650 | NaN | NaN | Yes | No |
Left join and Right join
First, let us define the left and right direction. The dataframe where we call merge() is the left one, and the dataframe that is the input of merge() is the right one. In other words, the syntax is always left_dataframe.merge(right_dataframe, <other options>)
. So, a left or right join keeps all rows with matched keys and rows without matches from the left or right dataframe, respectively. Like outer joins, rows with missing parts are filled with NaN. To use either types, we set the option how='left'
or how='right'
accordingly.
Merging dataframes with different key names
Lastly, how about merging dataframes of which the key columns have different names? It is actually very easy. Simply replace on=
with left='<left_key>'
and right='<right_key>'
with left and right dataframes defined as previously. For examples, in the patients_status_3.csv
data, the key is CaseID
instead of PatientID
:
patients_status = pd.read_csv('patients_status_3.csv')
patients_status.head(n=2)
CaseID | UrgentCare | Hospitalized | |
---|---|---|---|
0 | p0060251 | Yes | Yes |
1 | p0000515 | No | No |
patients_name.merge(patients_status, left_on='PatientID', right_on='CaseID')
PatientID | FirstName | LastName | CaseID | UrgentCare | Hospitalized | |
---|---|---|---|---|---|---|
0 | p0010503 | Bob | Williams | p0010503 | No | Yes |
1 | p0060251 | Carol | Vinay | p0060251 | Yes | Yes |
2 | p0000515 | Daniel | Traver | p0000515 | No | No |
Of course, in this case, you really need to verify that the two keys indeed refer to the same things. Otherwise, your merged data is pretty much meaningless!
Conclusion
In this post, I discussed the concept of merging dataframes with matching keys, as well as the different types of treating data without matches. Actually, there is a cross-join type as well. However, it is not for the exact same purpose with the other four types, so I will leave that for later. After this post, we should obtain enough knowledge on handling Pandas dataframes and NumPy arrays, and are now ready to start some analysis. See you later then!