Now that we have a solid grasp on the different types of joins and how they work, we can look at how to manage messier and more complex datasets. Joining real-world data from different sources often requires a bit of thought and cleaning ahead of time.
You know how to check for mismatched values between dataframes
You understand how to join using a one-to-many match
You know how to join on multiple key columns
‣ Load the packages needed for this lesson using the code provided below:
‣ Often, data from different sources require pre-cleaning before joining.
‣ Reasons include: - Spelling errors - Differences in capitalization - Extra spaces
‣ To join values successfully, they must match perfectly in R.
‣ Let’s use our mock patient data from the first lesson.
‣ Notice the different name formats in the demographic
and test_info
datasets.
## # A tibble: 3 × 2
## name age
## <chr> <dbl>
## 1 Alice 25
## 2 Bob 32
## 3 Charlie 45
test_info <- tribble(
~name, ~test_date, ~result,
"alice", "2023-06-05", "Negative",
"Bob", "2023-08-10", "Positive",
"charlie","2023-05-02", "Negative",
)
test_info
## # A tibble: 3 × 3
## name test_date result
## <chr> <chr> <chr>
## 1 alice 2023-06-05 Negative
## 2 Bob 2023-08-10 Positive
## 3 charlie 2023-05-02 Negative
‣ Now let’s join the two datasets.
## # A tibble: 3 × 4
## name age test_date result
## <chr> <dbl> <chr> <chr>
## 1 Alice 25 <NA> <NA>
## 2 Bob 32 2023-08-10 Positive
## 3 Charlie 45 <NA> <NA>
## # A tibble: 1 × 4
## name age test_date result
## <chr> <dbl> <chr> <chr>
## 1 Bob 32 2023-08-10 Positive
‣ The joins are not perfect due to the case differences in names.
‣ Solution: Convert all names to title case using
str_to_title()
.
test_info_title <- test_info %>%
mutate(name = str_to_title(name)) #converts to title case
test_info_title
## # A tibble: 3 × 3
## name test_date result
## <chr> <chr> <chr>
## 1 Alice 2023-06-05 Negative
## 2 Bob 2023-08-10 Positive
## 3 Charlie 2023-05-02 Negative
## # A tibble: 3 × 4
## name age test_date result
## <chr> <dbl> <chr> <chr>
## 1 Alice 25 2023-06-05 Negative
## 2 Bob 32 2023-08-10 Positive
## 3 Charlie 45 2023-05-02 Negative
## # A tibble: 3 × 4
## name age test_date result
## <chr> <dbl> <chr> <chr>
## 1 Alice 25 2023-06-05 Negative
## 2 Bob 32 2023-08-10 Positive
## 3 Charlie 45 2023-05-02 Negative
PRACTICE TIME!
(NOTE: Answers are at the bottom of the page. Try to answer the questions yourself before checking.)
The following two datasets contain data for India, Indonesia, and the
Philippines. However an inner_join()
of these datasets
produces no output. What are the differences between the values in the
key columns that would have to be changed before joining the
datasets?
df1 <- tribble(
~Country, ~Capital,
"India", "New Delhi",
"Indonesia", "Jakarta",
"Philippines", "Manila"
)
df2 <- tribble(
~Country, ~Population, ~Life_Expectancy,
"India ", 1393000000, 69.7,
"indonesia", 273500000, 71.7,
"Philipines", 113000000, 72.7
)
df2 <- df2 %>%
mutate(Country = str_trim(Country))
df2 <- df2 %>%
mutate(Country = str_trim(Country) %>%
str_to_title() %>%
str_replace("Philipines", "Philippines"))
inner_join(df1, df2, by = "Country")
## # A tibble: 3 × 4
## Country Capital Population Life_Expectancy
## <chr> <chr> <dbl> <dbl>
## 1 India New Delhi 1393000000 69.7
## 2 Indonesia Jakarta 273500000 71.7
## 3 Philippines Manila 113000000 72.7
‣ Working with small datasets makes it easy to spot key discrepancies
‣ But, how about dealing with larger datasets?
‣ Let’s explore this with two real-world datasets on TB in India
‣ The first dataset: TB notifications in 2022 for all 36 Indian states and Union Territories
‣ Source: Government of India Tuberculosis Report
## Rows: 72 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): state, hc_type
## dbl (1): tb_notif_count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tb_notifs_public <- tb_notifs %>%
filter(hc_type == "public") %>% #we want only public systems for now
select(-hc_type)
tb_notifs_public
## # A tibble: 36 × 2
## state tb_notif_count
## <chr> <dbl>
## 1 Andaman & Nicobar Islands 510
## 2 Andhra Pradesh 62075
## 3 Arunachal Pradesh 2722
## 4 Assam 36801
## 5 Bihar 79008
## 6 Chandigarh 5664
## 7 Chhattisgarh 26801
## 8 Dadra and Nagar Haveli and Daman and Diu 1294
## 9 Delhi 76966
## 10 Goa 1614
## # ℹ 26 more rows
‣ The second dataset: COVID screening among TB cases for 36 Indian states
‣ Also taken from the same TB Report
## Rows: 72 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): state, hc_type
## dbl (1): tb_covid_pos
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid_screening_public <- covid_screening %>%
filter(hc_type == "public") %>% #we want only public systems for now
select(-hc_type)
covid_screening_public
## # A tibble: 36 × 2
## state tb_covid_pos
## <chr> <dbl>
## 1 Andaman & Nicobar Islands 0
## 2 Andhra Pradesh 97
## 3 ArunachalPradesh 0
## 4 Assam 31
## 5 Bihar 78
## 6 Chandigarh 8
## 7 Chhattisgarh 57
## 8 Dadra & Nagar Haveli and Daman & Diu 1
## 9 Delhi 44
## 10 Goa 12
## # ℹ 26 more rows
‣ Objective: Join these datasets to calculate the percentage of TB patients in each state who tested positive for COVID-19
‣ Let’s attempt an inner_join()
:
## Joining with `by = join_by(state)`
## # A tibble: 32 × 3
## state tb_notif_count tb_covid_pos
## <chr> <dbl> <dbl>
## 1 Andaman & Nicobar Islands 510 0
## 2 Andhra Pradesh 62075 97
## 3 Assam 36801 31
## 4 Bihar 79008 78
## 5 Chandigarh 5664 8
## 6 Chhattisgarh 26801 57
## 7 Delhi 76966 44
## 8 Goa 1614 12
## 9 Gujarat 100949 105
## 10 Haryana 51231 44
## # ℹ 22 more rows
‣ Next, perform the percentage calculation:
## # A tibble: 32 × 4
## state tb_notif_count tb_covid_pos pct_covid_pos
## <chr> <dbl> <dbl> <dbl>
## 1 Andaman & Nicobar Islands 510 0 0
## 2 Andhra Pradesh 62075 97 0.156
## 3 Assam 36801 31 0.0842
## 4 Bihar 79008 78 0.0987
## 5 Chandigarh 5664 8 0.141
## 6 Chhattisgarh 26801 57 0.213
## 7 Delhi 76966 44 0.0572
## 8 Goa 1614 12 0.743
## 9 Gujarat 100949 105 0.104
## 10 Haryana 51231 44 0.0859
## # ℹ 22 more rows
‣ Observation: We now only have 32 rows instead of 36. Why?
‣ There are “key typos” causing mismatches during the join
‣ Key Typos: Spelling/formatting inconsistencies in key columns across datasets
‣ Example: One dataset lists “New Delhi” while the other lists “Delhi”
‣ These inconsistencies prevent proper matching and result in data loss
VOCAB TIME !
‣ “Key”: Column(s) used to match rows across datasets in a join
‣ “Key Typos”: Spelling or formatting inconsistencies in key columns across datasets
setdiff()
‣ We want to identify key typos in our data
‣ For this, we can use the setdiff()
function
‣ Let’s start by comparing the state
values from two
dataframes: tb_notifs_public
and
covid_screening_public
## [1] "Arunachal Pradesh"
## [2] "Dadra and Nagar Haveli and Daman and Diu"
## [3] "Tamil Nadu"
## [4] "Tripura"
‣ By putting the tb_notifs_public
dataset first, we
ask:
‣ “Which values are in tb_notifs_public
but not
in covid_screening_public
?”
‣ We should also check the reverse order:
‣ “Which values are in covid_screening_public
but
not in tb_notifs_public
?”
## [1] "ArunachalPradesh"
## [2] "Dadra & Nagar Haveli and Daman & Diu"
## [3] "tamil nadu"
## [4] "Tri pura"
‣ We found values in covid_screening_public
that have
spelling errors or are written differently than in
tb_notifs_public
‣ Let’s clean up covid_screening_public
using
case_when()
covid_screening_public_clean <- covid_screening_public %>%
mutate(state =
case_when(state == "ArunachalPradesh" ~ "Arunachal Pradesh", state == "tamil nadu" ~ "Tamil Nadu",
state == "Tri pura" ~ "Tripura",
state == "Dadra & Nagar Haveli and Daman & Diu" ~ "Dadra and Nagar Haveli and Daman and Diu",
TRUE ~ state))
setdiff(tb_notifs_public$state, covid_screening_public_clean$state)
## character(0)
## character(0)
‣ Now, we have no differences in the region’s names
‣ We can join our datasets:
## Joining with `by = join_by(state)`
antijoin()
‣ The anti_join()
function in {dplyr} is another way to
identify discrepancies
‣ It returns rows from the first dataframe where the key values don’t match the second dataframe
‣ Let’s find unmatched state
values in
tb_notifs_public
compared to
covid_screening_public
## Joining with `by = join_by(state)`
‣ And vice versa, for values in covid_screening_public
but not in tb_notifs_public
:
## Joining with `by = join_by(state)`
‣ This method provides more context for discrepancies
‣ After identifying, fix the errors with mutate()
and
proceed with the join
PRACTICE TIME !
The following dataframe, also taken from the TB Report, contains information on the number of pediatric TB cases and the number of pediatric patients initiated on treatment.
child <- read_csv(here("data/child_TB_india_modified.csv"))
child_public <- child %>%
filter(hc_type == "public") %>%
select(-hc_type)
child_public
set_diff()
or anti_join()
compare
the key values from the child_public
dataframe with those
from the tb_notifs_public
dataframe, which was defined
previouslychild_public
dataframe to ensure that the values match.## [1] "Arunachal Pradesh" "Jammu & Kashmir" "Kerala"
## [4] "Puducherry"
## [1] "ArunachalPradesh" "Jammu and Kashmir" "kerala"
## [4] "Pondicherry"
child_public_clean <- child_public %>%
mutate(state =
case_when(state == "ArunachalPradesh" ~ "Arunachal Pradesh",
state == "Jammu and Kashmir" ~ "Jammu & Kashmir",
state == "kerala" ~ "Kerala",
state == "Pondicherry" ~ "Puducherry",
TRUE ~ state))
setdiff(tb_notifs_public$state, child_public_clean$state)
## character(0)
## character(0)
## Joining with `by = join_by(state)`
tb_child_public %>%
mutate(prop_tb = 100 * tb_child_notifs/tb_notif_count) %>%
arrange(-prop_tb) %>%
head(2)
## # A tibble: 2 × 4
## state tb_notif_count tb_child_notifs prop_tb
## <chr> <dbl> <dbl> <dbl>
## 1 Delhi 76966 7867 10.2
## 2 Arunachal Pradesh 2722 256 9.40
‣ Key typos and formatting inconsistencies can hinder successful joins between datasets.
‣ Let’s explore a more complex scenario involving the
covid_screening_public
dataset.
## # A tibble: 36 × 2
## state tb_covid_pos
## <chr> <dbl>
## 1 Andaman & Nicobar Islands 0
## 2 Andhra Pradesh 97
## 3 ArunachalPradesh 0
## 4 Assam 31
## 5 Bihar 78
## 6 Chandigarh 8
## 7 Chhattisgarh 57
## 8 Dadra & Nagar Haveli and Daman & Diu 1
## 9 Delhi 44
## 10 Goa 12
## # ℹ 26 more rows
‣ Our goal is to enrich this dataset with zoning
information from the regions
dataset.
## Rows: 32 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): zonal_council, subdivision_category, state_UT
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 32 × 3
## zonal_council subdivision_category state_UT
## <chr> <chr> <chr>
## 1 No Zonal Council Union Territory Andaman & Nicobar Islands
## 2 North Eastern Council State Arunachal Pradesh
## 3 North Eastern Council State Assam
## 4 Eastern Zonal Council State Bihar
## 5 Northern Zonal Council Union Territory Chandigarh
## 6 Western Zonal Council Union Territory Dadra and Nagar Haveli and Daman…
## 7 Northern Zonal Council Union Territory Delhi
## 8 Western Zonal Council State Goa
## 9 Western Zonal Council State Gujarat
## 10 Northern Zonal Council State Haryana
## # ℹ 22 more rows
‣ Columns in regions
include zonal_council
,
subdivision_category
, and state_UT
.
‣ We’ll use a left join to merge without losing rows
from covid_screening_public
.
covid_regions <- left_join(covid_screening_public,
regions,
by = c("state" = "state_UT"))
covid_regions
## # A tibble: 36 × 4
## state tb_covid_pos zonal_council subdivision_category
## <chr> <dbl> <chr> <chr>
## 1 Andaman & Nicobar Islands 0 No Zonal Cou… Union Territory
## 2 Andhra Pradesh 97 <NA> <NA>
## 3 ArunachalPradesh 0 <NA> <NA>
## 4 Assam 31 North Easter… State
## 5 Bihar 78 Eastern Zona… State
## 6 Chandigarh 8 Northern Zon… Union Territory
## 7 Chhattisgarh 57 <NA> <NA>
## 8 Dadra & Nagar Haveli and Dam… 1 <NA> <NA>
## 9 Delhi 44 Northern Zon… Union Territory
## 10 Goa 12 Western Zona… State
## # ℹ 26 more rows
‣ After the join, some entries are missing zoning information.
## # A tibble: 7 × 4
## state tb_covid_pos zonal_council subdivision_category
## <chr> <dbl> <chr> <chr>
## 1 Andhra Pradesh 97 <NA> <NA>
## 2 ArunachalPradesh 0 <NA> <NA>
## 3 Chhattisgarh 57 <NA> <NA>
## 4 Dadra & Nagar Haveli and Dama… 1 <NA> <NA>
## 5 Ladakh 0 <NA> <NA>
## 6 tamil nadu 178 <NA> <NA>
## 7 Tri pura 2 <NA> <NA>
‣ To understand why, we’ll investigate using
anti_join()
.
## # A tibble: 3 × 3
## zonal_council subdivision_category state_UT
## <chr> <chr> <chr>
## 1 North Eastern Council State Arunachal Pradesh
## 2 Western Zonal Council Union Territory Dadra and Nagar Haveli and Daman a…
## 3 North Eastern Council State Tripura
‣ 3 states are present in regions
but absent in
covid_screening_public
.
‣ Now, let’s reverse the check.
## # A tibble: 7 × 2
## state tb_covid_pos
## <chr> <dbl>
## 1 Andhra Pradesh 97
## 2 ArunachalPradesh 0
## 3 Chhattisgarh 57
## 4 Dadra & Nagar Haveli and Daman & Diu 1
## 5 Ladakh 0
## 6 tamil nadu 178
## 7 Tri pura 2
‣ Some mismatches are due to key typos, while others
are absent from the regions
dataset.
‣ To correct typos, we’ll apply similar fixes as in a previous example.
# Correct state typos:
covid_screening_public_fixed <- covid_screening_public %>%
mutate(state =
case_when(state == "ArunachalPradesh" ~ "Arunachal Pradesh",
state == "Tri pura" ~ "Tripura",
state == "Dadra & Nagar Haveli and Daman & Diu" ~ "Dadra and Nagar Haveli and Daman and Diu",
TRUE ~ state))
‣ After applying the fixes, we perform another left join.
covid_regions_joined_fixed <- left_join(covid_screening_public_fixed,
regions,
by = c("state" = "state_UT"))
covid_regions_joined_fixed
## # A tibble: 36 × 4
## state tb_covid_pos zonal_council subdivision_category
## <chr> <dbl> <chr> <chr>
## 1 Andaman & Nicobar Islands 0 No Zonal Cou… Union Territory
## 2 Andhra Pradesh 97 <NA> <NA>
## 3 Arunachal Pradesh 0 North Easter… State
## 4 Assam 31 North Easter… State
## 5 Bihar 78 Eastern Zona… State
## 6 Chandigarh 8 Northern Zon… Union Territory
## 7 Chhattisgarh 57 <NA> <NA>
## 8 Dadra and Nagar Haveli and D… 1 Western Zona… Union Territory
## 9 Delhi 44 Northern Zon… Union Territory
## 10 Goa 12 Western Zona… State
## # ℹ 26 more rows
‣ Check for entries still missing zoning information.
# Check for missing zonal council information again:
covid_regions_joined_fixed %>%
filter(is.na(zonal_council))
## # A tibble: 4 × 4
## state tb_covid_pos zonal_council subdivision_category
## <chr> <dbl> <chr> <chr>
## 1 Andhra Pradesh 97 <NA> <NA>
## 2 Chhattisgarh 57 <NA> <NA>
## 3 Ladakh 0 <NA> <NA>
## 4 tamil nadu 178 <NA> <NA>
‣ Some regions were not included in the regions
dataset.
‣ This example highlights the challenges of ensuring no data loss during joins.
REMEMBER!
‣ Correcting typographical errors for successful joins is a complex task.
‣ Fuzzy matching may be necessary for imperfect string comparisons.
‣ Explore the {fuzzyjoin}
package in R for
solutions.
Run the code bellow to define two datasets.
The first, top_tb_cases_kids
records the top 20
countries with the highest incidence of tuberculosis (TB) in children
for the year 2012:
top_tb_cases_kids <- tidyr::who %>%
filter(year == 2012) %>%
transmute(country, iso3, tb_cases_smear_0_14 = new_sp_m014 + new_sp_f014) %>%
arrange(desc(tb_cases_smear_0_14)) %>%
head(20)
top_tb_cases_kids
And country_regions
lists countries along with their
respective regions and continents:
country_regions <- countrycode::codelist %>%
select(country_name = iso.name.en, iso3c, region) %>%
filter(complete.cases(country_name, region))
country_regions
Your task is to augment the TB cases data with the region and continent information without losing any relevant data.
left_join
of top_tb_cases_kids
with country_regions
with the country names as the key.
Identify which five countries fail to match correctly.top_tb_country_regions <- left_join(top_tb_cases_kids, country_regions, by = c("country" = "country_name"))
top_tb_country_regions %>%
filter(is.na(region))
top_tb_cases_kids
using case_when
to rectify
mismatches:top_tb_cases_kids_fixed <- top_tb_cases_kids %>%
mutate(country = case_when(
country == "Democratic Republic of the Congo" ~ "Congo, Democratic Republic of the",
country == "Philippines" ~ "Philippines (the)",
country == "Democratic People's Republic of Korea" ~ "Korea, Democratic People's Republic of",
country == "United Republic of Tanzania" ~ "Tanzania, United Republic of",
country == "Cote d'Ivoire" ~ "Côte d'Ivoire",
TRUE ~ country
))
top_tb_cases_kids_fixed
## # A tibble: 20 × 3
## country iso3 tb_cases_smear_0_14
## <chr> <chr> <dbl>
## 1 India IND 12957
## 2 Pakistan PAK 3947
## 3 Congo, Democratic Republic of the COD 3138
## 4 South Africa ZAF 2677
## 5 Indonesia IDN 1703
## 6 Nigeria NGA 1187
## 7 China CHN 1091
## 8 Philippines (the) PHL 1049
## 9 Kenya KEN 996
## 10 Angola AGO 982
## 11 Bangladesh BGD 966
## 12 Uganda UGA 636
## 13 Afghanistan AFG 588
## 14 Brazil BRA 580
## 15 Korea, Democratic People's Republic of PRK 520
## 16 Tanzania, United Republic of TZA 493
## 17 Nepal NPL 460
## 18 Madagascar MDG 419
## 19 Côte d'Ivoire CIV 367
## 20 Myanmar MMR 338
Now attempt the join again using the revised dataset.
left_join
, but this time use the
three-letter ISO code as the key. Do those initial five countries now
align properly?ISO codes improve accuracy, interoperability, and efficiency when working with country information, making them a best practice for managing international data.
‣ We’ve mainly looked at one-to-one joins.
‣ But what about one-to-many joins?
‣ Here, an observation in one dataframe corresponds to multiple observations in the other.
‣ To illustrate, let’s return to our patients and their COVID test data.
‣ Imagine Alice
and Xavier
got tested
multiple times for COVID.
test_info_many <- tribble(
~name, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Alice", "2023-06-10", "Positive",
"Bob", "2023-08-10", "Positive",
"Xavier", "2023-05-02", "Negative",
"Xavier", "2023-05-12", "Negative",
)
‣ Let’s see what happens when we use a left_join()
with
demographic
as the dataset to the left of the call:
## Joining with `by = join_by(name)`
## # A tibble: 4 × 4
## name age test_date result
## <chr> <dbl> <chr> <chr>
## 1 Alice 25 2023-06-05 Negative
## 2 Alice 25 2023-06-10 Positive
## 3 Bob 32 2023-08-10 Positive
## 4 Charlie 45 <NA> <NA>
‣ Here’s what happened:
‣ Alice
was retained.
‣ But she featured twice in the right dataset, so her demographic information was duplicated in the final dataset.
‣ Xavier
was dropped entirely.
‣ When performing a one-to-many join, the data from the “one” side is duplicated for each matching row of the “many” side.
Copy the code below to create two small dataframes:
patient_info <- tribble(
~patient_id, ~name, ~age,
1, "Liam", 32,
2, "Manny", 28,
3, "Nico", 40
)
conditions <- tribble(
~patient_id, ~disease,
1, "Diabetes",
1, "Hypertension",
2, "Asthma",
3, "High Cholesterol",
3, "Arthritis"
)
If you use a left_join()
to join these datasets, how
many rows will be in the final dataframe? Try to figure it out and then
perform the join to see if you were right!
## Joining with `by = join_by(patient_id)`
## # A tibble: 5 × 4
## patient_id name age disease
## <dbl> <chr> <dbl> <chr>
## 1 1 Liam 32 Diabetes
## 2 1 Liam 32 Hypertension
## 3 2 Manny 28 Asthma
## 4 3 Nico 40 High Cholesterol
## 5 3 Nico 40 Arthritis
‣ Explore the tb_notifs
dataset
## # A tibble: 72 × 3
## state hc_type tb_notif_count
## <chr> <chr> <dbl>
## 1 Andaman & Nicobar Islands public 510
## 2 Andaman & Nicobar Islands private 24
## 3 Andhra Pradesh public 62075
## 4 Andhra Pradesh private 30112
## 5 Arunachal Pradesh public 2722
## 6 Arunachal Pradesh private 141
## 7 Assam public 36801
## 8 Assam private 11021
## 9 Bihar public 79008
## 10 Bihar private 82157
## # ℹ 62 more rows
‣ Note: Two rows per state, for public and private health facilities
‣ Second dataset: regions
dataset, containing Indian
state and Union Territories
## Rows: 36 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): zonal_council, subdivision_category, state_UT
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 36 × 3
## zonal_council subdivision_category state_UT
## <chr> <chr> <chr>
## 1 No Zonal Council Union Territory Andaman & Nicobar Islands
## 2 North Eastern Council State Arunachal Pradesh
## 3 North Eastern Council State Assam
## 4 Eastern Zonal Council State Bihar
## 5 Northern Zonal Council Union Territory Chandigarh
## 6 Western Zonal Council Union Territory Dadra and Nagar Haveli and Daman…
## 7 Northern Zonal Council Union Territory Delhi
## 8 Western Zonal Council State Goa
## 9 Western Zonal Council State Gujarat
## 10 Northern Zonal Council State Haryana
## # ℹ 26 more rows
‣ Let’s try joining the datasets:
## # A tibble: 72 × 5
## state hc_type tb_notif_count zonal_council subdivision_category
## <chr> <chr> <dbl> <chr> <chr>
## 1 Andaman & Nicobar … public 510 No Zonal Cou… Union Territory
## 2 Andaman & Nicobar … private 24 No Zonal Cou… Union Territory
## 3 Andhra Pradesh public 62075 <NA> <NA>
## 4 Andhra Pradesh private 30112 <NA> <NA>
## 5 Arunachal Pradesh public 2722 North Easter… State
## 6 Arunachal Pradesh private 141 North Easter… State
## 7 Assam public 36801 North Easter… State
## 8 Assam private 11021 North Easter… State
## 9 Bihar public 79008 Eastern Zona… State
## 10 Bihar private 82157 Eastern Zona… State
## # ℹ 62 more rows
‣ Data from the regions
dataframe was duplicated
‣ Sometimes we have more than one column that uniquely identifies the observations that we want to match on.
‣ Consider systolic blood pressure measures for patients before and after a drug trial
blood_pressure <- tribble(
~name, ~time_point, ~systolic,
"Alice", "pre", 139,
"Alice", "post", 121,
"Bob", "pre", 137,
"Bob", "post", 128,
"Charlie", "pre", 137,
"Charlie", "post", 130 )
‣ Another dataset contains serum creatinine levels for the same patients and time points
kidney <- tribble(
~name, ~time_point, ~creatinine,
"Alice", "pre", 0.84,
"Alice", "post", 1.03,
"Bob", "pre", 0.87,
"Bob", "post", 1.21,
"Charlie", "pre", 0.88,
"Charlie", "post", 1.25 )
‣ Our goal: join two datasets so each patient has two rows, one for levels before the drug and one for levels after.
‣ First instinct: join on the patient’s name.
‣ Let’s try this out:
## Warning in left_join(., kidney, by = "name"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 12 × 5
## name time_point.x systolic time_point.y creatinine
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Alice pre 139 pre 0.84
## 2 Alice pre 139 post 1.03
## 3 Alice post 121 pre 0.84
## 4 Alice post 121 post 1.03
## 5 Bob pre 137 pre 0.87
## 6 Bob pre 137 post 1.21
## 7 Bob post 128 pre 0.87
## 8 Bob post 128 post 1.21
## 9 Charlie pre 137 pre 0.88
## 10 Charlie pre 137 post 1.25
## 11 Charlie post 130 pre 0.88
## 12 Charlie post 130 post 1.25
‣ Result: Duplicated rows, leading to four rows per person.
‣ “Many-to-many” relationship: A scenario we generally want to avoid!
‣ We also see two time_point
columns differentiated by
.x
and .y
.
‣ Instead, we should match on BOTH name
and
time_point
.
‣ Use c()
function to specify both column names.
## # A tibble: 6 × 4
## name time_point systolic creatinine
## <chr> <chr> <dbl> <dbl>
## 1 Alice pre 139 0.84
## 2 Alice post 121 1.03
## 3 Bob pre 137 0.87
## 4 Bob post 128 1.21
## 5 Charlie pre 137 0.88
## 6 Charlie post 130 1.25
‣ This gives us the desired outcome!
‣ Now, let’s apply this to our tb_notifs
and
covid_screening
datasets.
## # A tibble: 72 × 3
## state hc_type tb_notif_count
## <chr> <chr> <dbl>
## 1 Andaman & Nicobar Islands public 510
## 2 Andaman & Nicobar Islands private 24
## 3 Andhra Pradesh public 62075
## 4 Andhra Pradesh private 30112
## 5 Arunachal Pradesh public 2722
## 6 Arunachal Pradesh private 141
## 7 Assam public 36801
## 8 Assam private 11021
## 9 Bihar public 79008
## 10 Bihar private 82157
## # ℹ 62 more rows
## # A tibble: 72 × 3
## state hc_type tb_covid_pos
## <chr> <chr> <dbl>
## 1 Andaman & Nicobar Islands public 0
## 2 Andaman & Nicobar Islands private 0
## 3 Andhra Pradesh public 97
## 4 Andhra Pradesh private 17
## 5 ArunachalPradesh public 0
## 6 ArunachalPradesh private 0
## 7 Assam public 31
## 8 Assam private 16
## 9 Bihar public 78
## 10 Bihar private 53
## # ℹ 62 more rows
‣ Final dataframe goal: Two rows for each state, one for public and one for private sector data.
‣ Match on both state
and hc_type
using
c()
in the by=
statement.
## # A tibble: 72 × 4
## state hc_type tb_notif_count tb_covid_pos
## <chr> <chr> <dbl> <dbl>
## 1 Andaman & Nicobar Islands public 510 0
## 2 Andaman & Nicobar Islands private 24 0
## 3 Andhra Pradesh public 62075 97
## 4 Andhra Pradesh private 30112 17
## 5 Arunachal Pradesh public 2722 NA
## 6 Arunachal Pradesh private 141 NA
## 7 Assam public 36801 31
## 8 Assam private 11021 16
## 9 Bihar public 79008 78
## 10 Bihar private 82157 53
## # ℹ 62 more rows
‣ Success! We got the exact structure we wanted.
PRACTICE TIME !
In this practice, you will join three datasets:
notif_covid
, child
, and regions
.
Follow the steps below to ensure that no data is lost.
Follow these steps and fill in the blanked code fragments to complete the joining process.
notif_covid
and
child
using anti_join()
.child
by creating
child_fixed
. Use case_when()
to update state
names.# Clean mismatches
child_fixed <- child %>%
mutate(state = case_when(
state == "ArunachalPradesh" ~ "Arunachal Pradesh",
state == "Jammu and Kashmir" ~ "Jammu & Kashmir",
state == "kerala" ~ "Kerala",
state == "Pondicherry" ~ "Puducherry",
TRUE ~ state
))
anti_join()
. These should return empty dataframes.# Verify mismatches resolved
anti_join(child_fixed, notif_covid)
anti_join(notif_covid, child_fixed)
notif_covid
and child_fixed
using
left_join()
on state
and hc_type
.
Assign the result to join_1
.# Join notif_covid and child_fixed
join_1 <- notif_covid %>%
left_join(child_fixed, by = c("state", "hc_type"))
join_1
and
regions
using anti_join()
.# Check mismatches
anti_join(join_1, regions, by = c("state" = "state_UT"))
anti_join(regions, join_1, by = c("state_UT" = "state"))
You may notice that some states, e.g. Ladakh, are missing from the
regions
dataset. There is nothing we can do about this for
the moment, so we will proceed with the join.
left_join()
on
state
from join_1
and state_UT
from regions
. Assign the result to
final_join
.The question said to avoid losing information, so you may be
wondering why we used a left_join()
of a
full_join()
? The outputs are actually the same in this
case. Remember that left_join()
keeps all rows from the
left dataset. The missing states we commented in step 5 are present in
the left, join_1
dataset, so they are not lost in the final
join. And when you can choose between left_join()
and
full_join()
, it is better to use left_join()
as it is easier for your audience to reason about what your code is
doing.
final_join
dataset.In this lesson, we delved into the intricacies of data cleaning before a join, focusing on how to detect and correct mismatches or inconsistencies in key columns. We also highlighted the impact of one-to-many relationships in joining dataframes, showing how data from the “one” side is duplicated for each matching row of the “many” side. Finally, we demonstrated how to join dataframes using multiple key columns.
As we conclude this lesson, we hope that you have gained a deeper understanding of the importance and utility of joining dataframes in R.
df2_fixed <- df2 %>%
mutate(Country =
case_match(Country,
"India " ~ "India", # Remove blank space at end
"indonesia" ~ "Indonesia", # Capitalize
"Philipines" ~ "Philippines", # Fix spelling
.default=Country))
inner_join(df1, df2_fixed)
## Joining with `by = join_by(Country)`
## # A tibble: 3 × 4
## Country Capital Population Life_Expectancy
## <chr> <chr> <dbl> <dbl>
## 1 India New Delhi 1393000000 69.7
## 2 Indonesia Jakarta 273500000 71.7
## 3 Philippines Manila 113000000 72.7
## [1] "ArunachalPradesh" "Jammu and Kashmir" "kerala"
## [4] "Pondicherry"
## [1] "Arunachal Pradesh" "Jammu & Kashmir" "Kerala"
## [4] "Puducherry"
## Joining with `by = join_by(state)`
## # A tibble: 4 × 2
## state tb_child_notifs
## <chr> <dbl>
## 1 ArunachalPradesh 256
## 2 Jammu and Kashmir 511
## 3 kerala 480
## 4 Pondicherry 101
## Joining with `by = join_by(state)`
## # A tibble: 4 × 2
## state tb_notif_count
## <chr> <dbl>
## 1 Arunachal Pradesh 2722
## 2 Jammu & Kashmir 10022
## 3 Kerala 16766
## 4 Puducherry 3732
child_public_fixed <- child_public %>%
mutate(state =
case_when(state == "ArunachalPradesh" ~ "Arunachal Pradesh",
state == "Jammu and Kashmir" ~ "Jammu & Kashmir",
state == "kerala" ~ "Kerala",
state == "Pondicherry" ~ "Puducherry",
TRUE ~ state))
## Joining with `by = join_by(state)`
## [1] "Democratic Republic of the Congo"
## [2] "Philippines"
## [3] "Democratic People's Republic of Korea"
## [4] "United Republic of Tanzania"
## [5] "Cote d'Ivoire"
ISO codes are standardized - there in only one way of writing them. This makes it useful for joining.
## Joining with `by = join_by(patient_id)`
notif_covid
and
child
using anti_join()
.## Joining with `by = join_by(state, hc_type)`
## # A tibble: 8 × 3
## state hc_type tb_child_notifs
## <chr> <chr> <dbl>
## 1 ArunachalPradesh public 256
## 2 ArunachalPradesh private 19
## 3 Jammu and Kashmir public 511
## 4 Jammu and Kashmir private 89
## 5 kerala public 480
## 6 kerala private 409
## 7 Pondicherry public 101
## 8 Pondicherry private 6
## Joining with `by = join_by(state, hc_type)`
## # A tibble: 8 × 4
## state hc_type tb_notif_count tb_covid_pos
## <chr> <chr> <dbl> <dbl>
## 1 Arunachal Pradesh public 2722 NA
## 2 Arunachal Pradesh private 141 NA
## 3 Jammu & Kashmir public 10022 9
## 4 Jammu & Kashmir private 1782 1
## 5 Kerala public 16766 322
## 6 Kerala private 6622 37
## 7 Puducherry public 3732 16
## 8 Puducherry private 103 0
child
by creating
child_fixed
. Use case_when()
to update state
names.# Clean mismatches
child_fixed <- child %>%
mutate(state = case_when(
state == "ArunachalPradesh" ~ "Arunachal Pradesh",
state == "Jammu and Kashmir" ~ "Jammu & Kashmir",
state == "kerala" ~ "Kerala",
state == "Pondicherry" ~ "Puducherry",
TRUE ~ state
))
anti_join()
.## Joining with `by = join_by(state, hc_type)`
## # A tibble: 0 × 3
## # ℹ 3 variables: state <chr>, hc_type <chr>, tb_child_notifs <dbl>
## Joining with `by = join_by(state, hc_type)`
## # A tibble: 0 × 4
## # ℹ 4 variables: state <chr>, hc_type <chr>, tb_notif_count <dbl>,
## # tb_covid_pos <dbl>
notif_covid
and child_fixed
using
left_join()
on state
and hc_type
.
Assign the result to join_1
.# Join notif_covid and child_fixed
join_1 <- notif_covid %>%
left_join(child_fixed, by = c("state", "hc_type"))
join_1
and
regions
using anti_join()
.## # A tibble: 8 × 5
## state hc_type tb_notif_count tb_covid_pos tb_child_notifs
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Andhra Pradesh public 62075 97 1347
## 2 Andhra Pradesh private 30112 17 1333
## 3 Chhattisgarh public 26801 57 935
## 4 Chhattisgarh private 11720 8 974
## 5 Ladakh public 311 0 8
## 6 Ladakh private 9 0 0
## 7 Tamil Nadu public 71896 NA 1765
## 8 Tamil Nadu private 21983 NA 1651
## # A tibble: 0 × 3
## # ℹ 3 variables: zonal_council <chr>, subdivision_category <chr>,
## # state_UT <chr>
You may notice that some states, e.g. Ladakh, which are present in
join_1
are missing from the regions
dataset.
There is nothing we can do about this for the moment, so we will proceed
with the join.
left_join()
on
state
from join_1
and state_UT
from regions
. Assign the result to
final_join
.The question said to avoid losing information, so you may be
wondering why we used a left_join()
of a
full_join()
? The outputs are actually the same in this
case. Remember that left_join()
keeps all rows from the
left dataset. The missing states we commented in step 5 are present in
the left, join_1
dataset, so they are not lost in the final
join. And when you can choose between left_join()
and
full_join()
, it is better to use left_join()
as it is easier for your audience to reason about what your code is
doing.
final_join
dataset.## # A tibble: 72 × 7
## state hc_type tb_notif_count tb_covid_pos tb_child_notifs zonal_council
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Andaman & … public 510 0 18 No Zonal Cou…
## 2 Andaman & … private 24 0 1 No Zonal Cou…
## 3 Andhra Pra… public 62075 97 1347 <NA>
## 4 Andhra Pra… private 30112 17 1333 <NA>
## 5 Arunachal … public 2722 NA 256 North Easter…
## 6 Arunachal … private 141 NA 19 North Easter…
## 7 Assam public 36801 31 992 North Easter…
## 8 Assam private 11021 16 433 North Easter…
## 9 Bihar public 79008 78 4434 Eastern Zona…
## 10 Bihar private 82157 53 10778 Eastern Zona…
## # ℹ 62 more rows
## # ℹ 1 more variable: subdivision_category <chr>
The following team members contributed to this lesson: