Joining datasets is a crucial skill when working with health-related
data as it allows you to combine information from multiple sources,
leading to more comprehensive and insightful analyses. In this lesson,
you’ll learn how to use different joining techniques using R’s
dplyr
package. Let’s get started!
You understand how each of the different dplyr
joins
work: left, right, inner and full.
You’re able to choose the appropriate join for your data
You can join simple datasets together using functions from
dplyr
‣ Please load the packages needed for this lesson
‣ To illustrate the utility of joins, let’s start with a toy example.
‣ Consider the following two datasets: demographic
and
test_info
‣ demographic
: Contains names and ages of three
patients
‣ test_info
: Contains tuberculosis test dates and
results for those patients
test_info <-
tribble(~name, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Bob", "2023-08-10", "Positive",
"Charlie", "2023-07-15", "Negative")
test_info
‣ We’d like to analyze these data together, thus we
need to combine them. ‣ One option: cbind()
function from
base R
‣ Issue: Name column appears twice
‣ What if the rows in the two datasets are not already aligned?
‣ For example: test_info_disordered
test_info_disordered <-
tribble(~name, ~test_date, ~result,
"Bob", "2023-08-10", "Positive", # Bob in first row
"Alice", "2023-06-05", "Negative",
"Charlie", "2023-07-15", "Negative")
test_info_disordered
‣ The data is combined incorrectly. Very naive
‣ A third issue: One-to-many relationship
‣ Example: Alice did multiple TB tests
test_info_multiple <-
tribble(~name, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Alice", "2023-06-06", "Negative",
"Bob", "2023-08-10", "Positive",
"Charlie", "2023-07-15", "Negative")
test_info_multiple
‣ cbind()
would not work here due to a mismatch in row
counts:
Vocab Time
‣ One-to-many relationship: One entity in one dataset, multiple matched entities in another dataset. Will consider in future lesson.
‣ We need a smarter way to combine datasets
left_join()
‣ Solves the problems encountered with cbind()
‣ Simple Case: Works when datasets are perfectly matched.
## Joining with `by = join_by(name)`
‣ Does not duplicate the name column.
‣ Disordered Data: Works even if the datasets are not in the same order
## Joining with `by = join_by(name)`
‣ One to Many: Can handle multiple entries for a single entity
## Joining with `by = join_by(name)`
‣ Efficiency and Clarity: Simple yet powerful
Piping with left_join()
‣ Using the pipe operator %>%
with
left_join()
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`
‣ Joins operate on two dataframes: x
(the left
dataframe) and y
(the right dataframe).
‣ You can input these dataframes either as named or unnamed arguments:
# Let's try both ways:
left_join(x= demographic, y= test_info) # named
left_join(demographic, test_info) # unnamed
‣ The by
argument indicates the key for
connecting tables. Sometimes it is not needed:
# these are equivalent
left_join(x = demographic, y = test_info)
left_join(x = demographic, y = test_info, by = "name")
‣ Sometimes, it’s not necessary to supply by
; it can be
inferred from common columns.
# Here, "name" is the common column:
left_join(x = demographic, y = test_info)
left_join(x = demographic, y = test_info, by = "name")
‣ by
is sometimes required: what if keys are
named differently in the two datasets?
test_info_different_name <-
tribble(~test_recipient, ~test_date, ~result, # replace `name` with different word
"Alice", "2023-06-05", "Negative",
"Bob", "2023-08-10", "Positive",
"Charlie", "2023-07-15", "Negative")
test_info_different_name
‣ Attempting to join test_info_different_name
with
demographic
will lead to an error:
‣ Why? No obvious key
‣ Either rename the column or specify columns to
join on using by = c()
.
‣ c("name" = "test_recipient")
tells R to connect
name
from data frame x with test_recipient
from data frame y.
(NOTE: Answers are at the bottom of the page. Try to answer the questions yourself before checking.)
Left Join Patients and Checkups
Consider the two datasets below, one with patient details and the other with medical check-up dates for these patients.
patients <- tribble(
~patient_id, ~name, ~age,
1, "John", 32,
2, "Joy", 28,
3, "Khan", 40
)
checkups <- tribble(
~patient_id, ~checkup_date,
1, "2023-01-20",
2, "2023-02-20",
3, "2023-05-15"
)
Join the patients
dataset with the checkups
dataset using left_join()
## Joining with `by = join_by(patient_id)`
Left Join with by Argument
Two datasets are defined below, one with patient details and the other with vaccination records for those patients.
# Patient Details
patient_details <- tribble(
~id_number, ~full_name, ~address,
"A001", "Alice", "123 Elm St",
"B002", "Bob", "456 Maple Dr",
"C003", "Charlie", "789 Oak Blvd"
)
# Vaccination Records
vaccination_records <- tribble(
~patient_code, ~vaccine_type, ~vaccination_date,
"A001", "COVID-19", "2022-05-10",
"B002", "Flu", "2023-09-01",
"C003", "Hepatitis B", "2021-12-15"
)
Join the patient_details
and
vaccination_records
datasets. You will need to use the
by
argument because the patient identifier columns have
different names.
‣ In real-world, datasets do not match perfectly. Won’t always have Alice, Bob and Charlie in both datasets.
‣ Not every row in one dataset has a corresponding row in the other
‣ Different types of joins handle these imperfect matches:
left_join()
, right_join()
,
inner_join()
, full_join()
left_join()
‣ left_join()
retains all records from the left
dataframe
‣ Even if there is no match in the right dataframe
‣ Let’s revisit demographic
dataset and a modified
test_info
dataset
‣ Removing Charlie
, adding a new patient
Xavier
to test_info
dataset
# Create and display modified test_info dataset
test_info_xavier <- tribble(
~name, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Bob", "2023-08-10", "Positive",
"Xavier", "2023-05-02", "Negative")
test_info_xavier
‣ Now, perform a left_join()
:
# Perform a left join with `demographic` as left dataframe
left_join(x = demographic, y = test_info_xavier)
## Joining with `by = join_by(name)`
‣ Charlie
is retained with NA
values for
test info
‣ Xavier
is discarded
‣ In left_join(x = demographic, y = test_info_xavier)
,
all records from the left dataframe
(demographic
) are retained.
‣ Visual representation of how left_join()
works
‣ What happens when we switch the left and right datasets?
# Perform a left join with test_info_xavier as left dataframe
left_join(test_info_xavier, demographic)
## Joining with `by = join_by(name)`
‣ Now, Xavier’s data is included and Charlie’s data is excluded.
left_join()
retains all rows from the left dataset,
test_info_xavier
Primary Dataset: This is the “main” or “prioritized” dataset in a join. In a left join, the left dataset is the primary dataset.
Left Join Diagnoses and Demographics
Try out the following. Below are two datasets - one with disease
diagnoses (disease_dx
) and another with patient
demographics (patient_demographics
).
disease_dx <- tribble(
~patient_id, ~disease, ~date_of_diagnosis,
1, "Influenza", "2023-01-15",
4, "COVID-19", "2023-03-05",
8, "Influenza", "2023-02-20",
)
patient_demographics <- tribble(
~patient_id, ~name, ~age, ~gender,
1, "Fred", 28, "Female",
2, "Genevieve", 45, "Female",
3, "Henry", 32, "Male",
5, "Irene", 55, "Female",
8, "Jules", 40, "Male"
)
Use left_join()
to merge these datasets, keeping only
patients for whom we have demographic information. Think carefully about
which dataset to put on the left.
## Joining with `by = join_by(patient_id)`
‣ In this example, we’ll analyze TB incidence and government health expenditure in 47 African countries
‣ Data on TB incidence rate per 100,000 people from the World Health Organization (WHO)
## Rows: 47 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, conf_int_95
## dbl (1): cases
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
‣ Data on health expenditure per capita from countries around the world, also from the WHO
## Rows: 185 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (1): expend_usd
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
‣ Imagine you wanted to see how TB incidence varies with health expenditure in African countries? Which dataset should be on the left?
‣ Use tb_2019_africa
as the left
dataframe in the join to ensure all African countries are
included in the final dataset
## Joining with `by = join_by(country)`
‣ 47 rows retained for the 47 African countries.
‣ Next, check for any countries in tb_2019_africa
that
did not have a match in health_exp_2019
‣ Mauritius, South Sudan, and Comoros did not have expenditure data, but are still in the joined dataset
‣ Confirm these countries are absent from the expenditure dataset
‣ Will have to leave these countries out of the analysis.
Left Join TB Cases and Continents
Copy the code below to define two datasets.
The first, tb_cases_children
contains the number of TB
cases in under 15s in 2012, by country:
tb_cases_children <- tidyr::who %>%
filter(year == 2012) %>%
transmute(country, tb_cases_smear_0_14 = new_sp_m014 + new_sp_f014)
tb_cases_children
And country_continents
, from the {countrycode} package,
lists all countries and their corresponding region and continent:
country_continents <-
countrycode::codelist %>%
select(country.name.en, continent, region)
country_continents
Your goal is to add the continent and region data to the TB cases dataset.
Which dataset should be the left dataframe, x
? And which
should be the right, y
? Once you’ve decided, join the
datasets appropriately using left_join()
.
right_join()
‣ A right_join()
is like a mirror image of a
left_join()
‣ Retains all rows from the RIGHT dataset
‣ Example using the demographic
and
test_info_xavier
datasets
‣ Now, let’s use right_join()
with
demographic
as the right dataframe
## Joining with `by = join_by(name)`
‣ All rows from demographic
are kept (Alice, Bob and
Charlie)
‣ Only matching records in test_info_xavier
are kept
‣ Right join prioritizes the dataset on the right,
demographic
‣ The image below illustrates the right_join()
process
‣ The same final dataframe can be created with either
left_join()
or right_join()
‣ It depends on the order of the data frames
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`
‣ Column order may vary between left_join()
and
right_join()
‣ Columns can be rearranged, so no need to worry about their order
‣ Data scientists typically favor left_join()
over
right_join()
‣ Clearer logic and less error-prone. Primary dataset
(x
) comes FIRST in the function.
‣ No practice question here. Just ignore right_join()
.
By the time you need it, you’ll know what to do.
‣ Moving on from left_join()
and
right_join()
, let’s explore inner_join()
and
full_join()
.
inner_join()
‣ inner_join
keeps rows that are common to
both dataframes.
‣ Let’s revisit our example with patients and their COVID test results.
‣ Think about what the result would be using
inner_join()
.
‣ Only Alice
and Bob
are in both
datasets.
## Joining with `by = join_by(name)`
‣ Charlie
was only in demographic
,
Xavier
was only in test_info
, so they are
removed.
‣ The order of datasets in inner_join()
does
not affect the result.
## Joining with `by = join_by(name)`
Inner Join Pathogens
The following data is on foodborne-outbreaks in the US in 2019, from the CDC. Copy the code below to create two new dataframes:
total_inf <- tribble(
~pathogen, ~total_infections,
"Campylobacter", 9751,
"Listeria", 136,
"Salmonella", 8285,
"Shigella", 2478,
)
outcomes <- tribble(
~pathogen, ~n_hosp, ~n_deaths,
"Listeria", 128, 30,
"STEC", 582, 11,
"Campylobacter", 1938, 42,
"Yersinia", 200, 5,
)
Which pathogens are common between both datasets? Use an
inner_join()
to join the dataframes, in order to keep only
the pathogens that feature in both datasets.
## Joining with `by = join_by(pathogen)`
‣ Now, let’s return to our health expenditure and TB incidence rates data.
‣ Create a new dataframe, inner_exp_tb
, to retain only
countries present in both datasets.
## Joining with `by = join_by(country)`
‣ inner_join()
is a commonly used join, but remember it
can exclude a lot of data.
‣ Next, we will explore full_join()
, the most inclusive
join.
Inner Join One Row
The code chunk below filters the health_exp_2019
dataset
to the 70 countries with the highest spending:
Use an inner_join()
to join this
highest_exp
dataset with the African TB incidence dataset,
tb_2019_africa
.
If you do this correctly, there will be just one row returned. Why?
## Joining with `by = join_by(country)`
full_join()
‣ full_join()
retains all records from both
datasets.
‣ If there are missing matches between the datasets, the function
fills in with NA
.
‣ Let’s demonstrate this with the demographic
and
test_info_xavier
datasets.
## Joining with `by = join_by(name)`
‣ All rows are kept, preventing information loss.
‣ The order of datasets affects the order of columns, but not the retained information.
## Joining with `by = join_by(name)`
‣ Again, all data is retained with missing values filled as
NA
.
PRACTICE TIME !
Full Join Malaria Data
The following dataframes contain global malaria incidence rates per 100’000 people and global death rates per 100’000 people from malaria, from Our World in Data. Copy the code to create two small dataframes:
malaria_inc <- tribble(
~year, ~inc_100k,
2010, 69.485344,
2011, 66.507935,
2014, 59.831020,
2016, 58.704540,
2017, 59.151703,
)
malaria_deaths <- tribble(
~year, ~deaths_100k,
2011, 12.92,
2013, 11.00,
2015, 10.11,
2016, 9.40,
2019, 8.95
)
Then, join the above tables using a full_join()
in order
to retain all information from the two datasets.
## Joining with `by = join_by(year)`
‣ Now, let’s revisit the TB dataset and health expenditure dataset.
‣ Create a new dataframe, full_tb_health
, using a
full_join
.
## Joining with `by = join_by(country)`
‣ All rows are kept, with NA
for missing values.
‣ Venn diagrams of Left, Right, Inner and Full join:
You understand how each of the different dplyr
joins
work: left, right, inner and full.
You’re able to choose the appropriate join for your data
dplyr
## Joining with `by = join_by(patient_id)`
## Joining with `by = join_by(patient_id)`
## Joining with `by = join_by(country)`
There is only one country in common between the two datasets.
The following team members contributed to this lesson: