In the previous lesson, we learned a range of functions for diagnosing data issues. Now, let’s focus on some common techniques and functions for fixing those issues. Let’s get started!
By the end of this lesson, you will be able to:
Load the following packages for this lesson:
‣ Working with a modified version of the dataset
from the first Data Cleaning
lesson.
‣ More errors have been added for cleaning purposes.
## Rows: 1420 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Sex, Age_35, EDUCATION_OF_PATIENT, OCCUPATION_OF_PATIENT, Civil...s...
## dbl (9): patient_id, District, Health unit, Age at ART initiation, WHO statu...
## lgl (1): NA
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1,420 × 15
## patient_id District `Health unit` Sex Age_35 `Age at ART initiation`
## <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 10037 1 1 Male over 35 36
## 2 10537 1 1 F over 35 40
## 3 5489 2 3 F Under 35 34.1
## 4 5523 2 3 Male Under 35 28.1
## 5 4942 2 3 F over 35 46.9
## 6 4742 2 3 Male over 35 37.5
## 7 10879 1 1 Male over 35 49.2
## 8 2885 2 3 Male over 35 43.2
## 9 4861 2 3 F over 35 50.9
## 10 5180 2 3 Male over 35 36.1
## # ℹ 1,410 more rows
## # ℹ 9 more variables: EDUCATION_OF_PATIENT <chr>, OCCUPATION_OF_PATIENT <chr>,
## # Civil...status <chr>, `WHO status at ART initiaiton` <dbl>,
## # BMI_Initiation_Art <dbl>, CD4_Initiation_Art <dbl>, regimen.1 <dbl>,
## # Nr_of_pills_day <dbl>, `NA` <lgl>
‣ Column names should be clean and standardized for ease of use and readability.
‣ Ideal column names should be short, have no spaces or periods, no unusual characters, and similar style.
‣ Use the names()
function from base R to check column
names of our non_adherence
dataset.
## [1] "patient_id" "District"
## [3] "Health unit" "Sex"
## [5] "Age_35" "Age at ART initiation"
## [7] "EDUCATION_OF_PATIENT" "OCCUPATION_OF_PATIENT"
## [9] "Civil...status" "WHO status at ART initiaiton"
## [11] "BMI_Initiation_Art" "CD4_Initiation_Art"
## [13] "regimen.1" "Nr_of_pills_day"
## [15] "NA"
‣ Some names have spaces, special characters, or are not uniformly cased.
janitor::clean_names()
‣ Use janitor::clean_names()
to standardize
column names.
## [1] "patient_id" "district"
## [3] "health_unit" "sex"
## [5] "age_35" "age_at_art_initiation"
## [7] "education_of_patient" "occupation_of_patient"
## [9] "civil_status" "who_status_at_art_initiaiton"
## [11] "bmi_initiation_art" "cd4_initiation_art"
## [13] "regimen_1" "nr_of_pills_day"
## [15] "na"
‣ Observe changes like upper case to lower case, spaces to underscores, and periods replaced.
‣ Let’s save this cleaned dataset as
non_adherence_clean
.
(NOTE: Answers are at the bottom of the page. Try to answer the questions yourself before checking.)
The following dataset has been adapted from a study that used retrospective data to characterize the tmporal and spatial dynamics of typhoid fever epidemics in Kasene, Uganda.
Use the clean_names()
function from janitor
to clean the variables names in the typhoid
dataset.
## Rows: 215 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): Householdmembers, Positioninthehousehold, Watersourcedwithinhouseh...
## dbl (11): UniqueKey, CaseorControl, Age, Sex, Levelofeducation, Below10years...
## lgl (2): NA, NAN
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "unique_key" "caseor_control"
## [3] "age" "sex"
## [5] "levelofeducation" "householdmembers"
## [7] "below10years" "n1119years"
## [9] "n2035years" "n3644years"
## [11] "n4565years" "above65years"
## [13] "positioninthehousehold" "watersourcedwithinhousehold"
## [15] "borehole" "river"
## [17] "tap" "rainwatertank"
## [19] "unprotectedspring" "protectedspring"
## [21] "pond" "shallowwell"
## [23] "stream" "jerrycan"
## [25] "bucket" "county"
## [27] "subcounty" "parish"
## [29] "village" "na"
## [31] "nan"
dplyr::rename_with()
for Renaming Columns‣ rename_with()
from dplyr
allows applying
functions to all column names. Sometimes easier to use than
rename()
.
‣ Example: Convert all column names to upper case with
rename_with(colname, toupper)
.
## # A tibble: 1,420 × 15
## PATIENT_ID DISTRICT `HEALTH UNIT` SEX AGE_35 `AGE AT ART INITIATION`
## <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 10037 1 1 Male over 35 36
## 2 10537 1 1 F over 35 40
## 3 5489 2 3 F Under 35 34.1
## 4 5523 2 3 Male Under 35 28.1
## 5 4942 2 3 F over 35 46.9
## 6 4742 2 3 Male over 35 37.5
## 7 10879 1 1 Male over 35 49.2
## 8 2885 2 3 Male over 35 43.2
## 9 4861 2 3 F over 35 50.9
## 10 5180 2 3 Male over 35 36.1
## # ℹ 1,410 more rows
## # ℹ 9 more variables: EDUCATION_OF_PATIENT <chr>, OCCUPATION_OF_PATIENT <chr>,
## # CIVIL...STATUS <chr>, `WHO STATUS AT ART INITIAITON` <dbl>,
## # BMI_INITIATION_ART <dbl>, CD4_INITIATION_ART <dbl>, REGIMEN.1 <dbl>,
## # NR_OF_PILLS_DAY <dbl>, `NA` <lgl>
‣ Another task: In the non_adherence
dataset, remove
_of_patient
from column names for simplicity.
‣ Use stringr::str_replace_all()
within
rename_with()
for this task.
‣ str_replace_all()
syntax:
str_replace_all(string, pattern, replacement)
.
test_string <- "this is a test test string" # replace test with new
str_replace_all(string = test_string, pattern = "test", replacement = "new")
## [1] "this is a new new string"
‣ Apply str_replace_all()
to remove
_of_patient
in column names of
non_adherence_clean
.
non_adherence_clean_2 <- non_adherence_clean %>%
rename_with(.cols = c(occupation_of_patient, education_of_patient), .fn = ~ str_replace_all(.x, "_of_patient", ""))
# non_adherence_clean then rename_with()
Remember, creating many intermediate objects like
non_adherence_clean
and non_adherence_clean_2
is for tutorial clarity. In practice, combine multiple cleaning steps in
a single pipe chain:
Standardize the column names in the typhoid
dataset with
clean_names()
then;
replace or_
with _
replace of
with _
typhoid %>%
clean_names() %>%
rename_with(.cols = c(caseor_control, levelofeducation), .fn = ~ str_replace_all(.x, c("or_", "of"), "_"))
## # A tibble: 215 × 31
## unique_key case_control age sex level_education householdmembers
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 0 29 0 2 01-May
## 2 2 0 31 1 1 9
## 3 3 1 21 0 1 12
## 4 4 0 47 1 0 7
## 5 5 0 39 1 1 7
## 6 6 1 46 1 0 9
## 7 7 0 58 0 1 01-May
## 8 8 0 48 0 1 7
## 9 9 1 21 1 3 10
## 10 10 0 38 1 0 7
## # ℹ 205 more rows
## # ℹ 25 more variables: below10years <dbl>, n1119years <dbl>, n2035years <dbl>,
## # n3644years <dbl>, n4565years <dbl>, above65years <dbl>,
## # positioninthehousehold <chr>, watersourcedwithinhousehold <chr>,
## # borehole <chr>, river <chr>, tap <chr>, rainwatertank <chr>,
## # unprotectedspring <chr>, protectedspring <chr>, pond <chr>,
## # shallowwell <chr>, stream <chr>, jerrycan <chr>, bucket <chr>, …
‣ Duplicated rows in datasets can be due to multiple data sources or survey responses.
‣ It’s essential to identify and remove these duplicates for accurate analysis.
‣ Use janitor::get_dupes()
to identify duplicate
rows. This allows for visual inspection before
removal.
## No variable names specified - using all columns.
## # A tibble: 11 × 16
## patient_id district health_unit sex age_35 age_at_art_initiation education
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 NA NA NA <NA> <NA> NA <NA>
## 2 NA NA NA <NA> <NA> NA <NA>
## 3 NA NA NA <NA> <NA> NA <NA>
## 4 2412 1 1 F Under … 27.1 <NA>
## 5 2412 1 1 F Under … 27.1 <NA>
## 6 3576 2 3 Male Under … 28.4 <NA>
## 7 3576 2 3 Male Under … 28.4 <NA>
## 8 4208 1 1 F Under … 31.7 Primary
## 9 4208 1 1 F Under … 31.7 Primary
## 10 4692 2 3 F over 35 54.2 <NA>
## 11 4692 2 3 F over 35 54.2 <NA>
## # ℹ 9 more variables: occupation <chr>, civil_status <chr>,
## # who_status_at_art_initiaiton <dbl>, bmi_initiation_art <dbl>,
## # cd4_initiation_art <dbl>, regimen_1 <dbl>, nr_of_pills_day <dbl>, na <lgl>,
## # dupe_count <int>
‣ After identifying, use dplyr::distinct()
to
remove duplicates, keeping only the unique
rows.
## [1] 1420
# Removing duplicates
non_adherence_distinct <-
non_adherence_clean_2 %>%
distinct()
# After removal
nrow(non_adherence_distinct)
## [1] 1414
‣ Re-check for duplicates with get_dupes()
to ensure all
have been removed.
## No variable names specified - using all columns.
## No duplicate combinations found of: patient_id, district, health_unit, sex, age_35, age_at_art_initiation, education, occupation, civil_status, ... and 6 other variables
## # A tibble: 0 × 16
## # ℹ 16 variables: patient_id <dbl>, district <dbl>, health_unit <dbl>,
## # sex <chr>, age_35 <chr>, age_at_art_initiation <dbl>, education <chr>,
## # occupation <chr>, civil_status <chr>, who_status_at_art_initiaiton <dbl>,
## # bmi_initiation_art <dbl>, cd4_initiation_art <dbl>, regimen_1 <dbl>,
## # nr_of_pills_day <dbl>, na <lgl>, dupe_count <int>
Identify the duplicates in the typhoid
dataset using
get_dupes()
, then remove them using
distinct()
.
## [1] 215
## No variable names specified - using all columns.
## # A tibble: 18 × 32
## UniqueKey CaseorControl Age Sex Levelofeducation Householdmembers
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 23 0 23 1 1 01-May
## 2 23 0 23 1 1 01-May
## 3 23 0 23 1 1 01-May
## 4 23 0 23 1 1 01-May
## 5 56 0 24 1 0 01-May
## 6 56 0 24 1 0 01-May
## 7 56 0 24 1 0 01-May
## 8 56 0 24 1 0 01-May
## 9 78 1 36 1 1 7
## 10 78 1 36 1 1 7
## 11 78 1 36 1 1 7
## 12 78 1 36 1 1 7
## 13 100 0 50 0 1 7
## 14 100 0 50 0 1 7
## 15 100 0 50 0 1 7
## 16 100 0 50 0 1 7
## 17 NA NA NA NA NA <NA>
## 18 NA NA NA NA NA <NA>
## # ℹ 26 more variables: Below10years <dbl>, N1119years <dbl>, N2035years <dbl>,
## # N3644years <dbl>, N4565years <dbl>, Above65years <dbl>,
## # Positioninthehousehold <chr>, Watersourcedwithinhousehold <chr>,
## # Borehole <chr>, River <chr>, Tap <chr>, Rainwatertank <chr>,
## # Unprotectedspring <chr>, Protectedspring <chr>, Pond <chr>,
## # Shallowwell <chr>, Stream <chr>, Jerrycan <chr>, Bucket <chr>,
## # County <chr>, Subcounty <chr>, Parish <chr>, Village <chr>, `NA` <lgl>, …
# Remove duplicated rows
typhoid_distinct <- typhoid %>%
distinct()
# Number of rows aqfter removing duplicates
nrow(typhoid_distinct)
## [1] 202
‣ We observed inconsistent capitalization in string
characters, like Professor
and professor
, in
the occupation
variable.
‣ To address this, we can transform character columns to a specific case. Here, we’ll use title case. Preferable for graphics and reports.
non_adherence_case_corrected <-
non_adherence_distinct %>%
mutate(across(.cols = c(sex, age_35, education, occupation, civil_status), .fns = str_to_title)) # then the across function
# check the values of age_35 and occupation
non_adherence_distinct %>%
count(age_35)
## # A tibble: 3 × 2
## age_35 n
## <chr> <int>
## 1 Under 35 976
## 2 over 35 437
## 3 <NA> 1
## # A tibble: 51 × 2
## occupation n
## <chr> <int>
## 1 Professor 35
## 2 professor 11
## 3 Accountant 1
## 4 Administrator 1
## 5 Agriculture technician 3
## 6 Artist 1
## 7 Basic service agent 2
## 8 Boat captain 1
## 9 Business 3
## 10 Commercial 18
## # ℹ 41 more rows
## # A tibble: 3 × 2
## age_35 n
## <chr> <int>
## 1 Over 35 437
## 2 Under 35 976
## 3 <NA> 1
non_adherence_case_corrected %>%
count(occupation) %>%
arrange(-(str_detect(occupation, "rofessor")))
## # A tibble: 49 × 2
## occupation n
## <chr> <int>
## 1 Professor 46
## 2 Accountant 1
## 3 Administrator 1
## 4 Agriculture Technician 3
## 5 Artist 1
## 6 Bartender 1
## 7 Basic Service Agent 2
## 8 Boat Captain 1
## 9 Business 3
## 10 Commercial 18
## # ℹ 39 more rows
Transform all the strings in the typhoid
dataset to
lowercase.
## # A tibble: 202 × 31
## UniqueKey CaseorControl Age Sex Levelofeducation Householdmembers
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 0 29 0 2 01-May
## 2 2 0 31 1 1 9
## 3 3 1 21 0 1 12
## 4 4 0 47 1 0 7
## 5 5 0 39 1 1 7
## 6 6 1 46 1 0 9
## 7 7 0 58 0 1 01-May
## 8 8 0 48 0 1 7
## 9 9 1 21 1 3 10
## 10 10 0 38 1 0 7
## # ℹ 192 more rows
## # ℹ 25 more variables: Below10years <dbl>, N1119years <dbl>, N2035years <dbl>,
## # N3644years <dbl>, N4565years <dbl>, Above65years <dbl>,
## # Positioninthehousehold <chr>, Watersourcedwithinhousehold <chr>,
## # Borehole <chr>, River <chr>, Tap <chr>, Rainwatertank <chr>,
## # Unprotectedspring <chr>, Protectedspring <chr>, Pond <chr>,
## # Shallowwell <chr>, Stream <chr>, Jerrycan <chr>, Bucket <chr>, …
dplyr::case_match()
for String Cleaning‣ We will explore the case_match()
function from the
{dplyr} package for string cleaning.
‣ case_match()
allows for specifying conditions and
values to be applied to a vector.
‣ Here is an example using case_match()
:
test_vector <- c("+", "-", "NA", "missing")
case_match(test_vector,
"+" ~ "positive",
"-" ~ "negative",
.default = "unknown") # + to positive, - to negative, default as unknown
‣ The function takes a vector and series of conditions.
.default
is optional for unmatched conditions.
‣ Let’s apply case_match()
to the sex
column in the non_adherence_distinct
dataset.
‣ First, observe the levels in this variable:
## # A tibble: 3 × 2
## sex n
## <chr> <int>
## 1 F 1084
## 2 Male 329
## 3 <NA> 1
‣ Inconsistencies in the sex
column coding can be fixed
using case_match()
. Let’s change F
to
Female
:
# case match F to Female, with default as is
non_adherence_distinct %>%
mutate(sex = case_match(sex, "F" ~ "Female", .default = sex))
## # A tibble: 1,414 × 15
## patient_id district health_unit sex age_35 age_at_art_initiation education
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 10037 1 1 Male over … 36 <NA>
## 2 10537 1 1 Female over … 40 Secondary
## 3 5489 2 3 Female Under… 34.1 <NA>
## 4 5523 2 3 Male Under… 28.1 <NA>
## 5 4942 2 3 Female over … 46.9 Universi…
## 6 4742 2 3 Male over … 37.5 Technical
## 7 10879 1 1 Male over … 49.2 Technical
## 8 2885 2 3 Male over … 43.2 Technical
## 9 4861 2 3 Female over … 50.9 Technical
## 10 5180 2 3 Male over … 36.1 Technical
## # ℹ 1,404 more rows
## # ℹ 8 more variables: occupation <chr>, civil_status <chr>,
## # who_status_at_art_initiaiton <dbl>, bmi_initiation_art <dbl>,
## # cd4_initiation_art <dbl>, regimen_1 <dbl>, nr_of_pills_day <dbl>, na <lgl>
‣ This function is useful for multiple value changes, like in the
occupation
column.
‣ Modifications to be made: - “Worker” to “Laborer” - “Housewife” to “Homemaker” - “Truck Driver” and “Taxi Driver” to “Driver”
non_adherence_recoded <-
non_adherence_case_corrected %>%
mutate(sex = case_match(sex, "F" ~ "Female", .default = sex)) %>%
mutate(occupation = case_match(occupation, "Worker" ~ "Laborer", "Housewife" ~ "Homemaker", "Truck Driver" ~ "Driver", "Taxi Driver" ~ "Driver",
.default = occupation))
# case match Worker to Laborer, Housewife to Homemaker, Truck Driver and Taxi Driver to Driver
non_adherence_recoded
## # A tibble: 1,414 × 15
## patient_id district health_unit sex age_35 age_at_art_initiation education
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 10037 1 1 Male Over … 36 <NA>
## 2 10537 1 1 Female Over … 40 Secondary
## 3 5489 2 3 Female Under… 34.1 <NA>
## 4 5523 2 3 Male Under… 28.1 <NA>
## 5 4942 2 3 Female Over … 46.9 Universi…
## 6 4742 2 3 Male Over … 37.5 Technical
## 7 10879 1 1 Male Over … 49.2 Technical
## 8 2885 2 3 Male Over … 43.2 Technical
## 9 4861 2 3 Female Over … 50.9 Technical
## 10 5180 2 3 Male Over … 36.1 Technical
## # ℹ 1,404 more rows
## # ℹ 8 more variables: occupation <chr>, civil_status <chr>,
## # who_status_at_art_initiaiton <dbl>, bmi_initiation_art <dbl>,
## # cd4_initiation_art <dbl>, regimen_1 <dbl>, nr_of_pills_day <dbl>, na <lgl>
Remember to use .default=column_name
in
case_match()
. Without it, unmatched values become
NA
.
The variable householdmembers
from the
typhoid
dataset should represent the number of individuals
in a household. There is a value 01-May
in this variable.
Recode this value to 1-5
.
typhoid %>%
mutate(Householdmembers = case_match(Householdmembers,
"01-May" ~ "1-5", .default = Householdmembers))
## # A tibble: 215 × 31
## UniqueKey CaseorControl Age Sex Levelofeducation Householdmembers
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 0 29 0 2 1-5
## 2 2 0 31 1 1 9
## 3 3 1 21 0 1 12
## 4 4 0 47 1 0 7
## 5 5 0 39 1 1 7
## 6 6 1 46 1 0 9
## 7 7 0 58 0 1 1-5
## 8 8 0 48 0 1 7
## 9 9 1 21 1 3 10
## 10 10 0 38 1 0 7
## # ℹ 205 more rows
## # ℹ 25 more variables: Below10years <dbl>, N1119years <dbl>, N2035years <dbl>,
## # N3644years <dbl>, N4565years <dbl>, Above65years <dbl>,
## # Positioninthehousehold <chr>, Watersourcedwithinhousehold <chr>,
## # Borehole <chr>, River <chr>, Tap <chr>, Rainwatertank <chr>,
## # Unprotectedspring <chr>, Protectedspring <chr>, Pond <chr>,
## # Shallowwell <chr>, Stream <chr>, Jerrycan <chr>, Bucket <chr>, …
‣ Understanding and correctly classifying 2data types is crucial for data to behave as expected.
R’s 6 basic data types/classes:
character
: strings or characters, always quoted.numeric
: real numbers, including decimals.integer
: whole numbers.logical
: TRUE
or FALSE
values.factor
: categorical variables.Date/POSIXct
: dates and times.‣ Recall our dataset: 5 character variables and 9 numeric variables.
## tibble [1,414 × 15] (S3: tbl_df/tbl/data.frame)
## $ patient_id : num [1:1414] 10037 10537 5489 5523 4942 ...
## $ district : num [1:1414] 1 1 2 2 2 2 1 2 2 2 ...
## $ health_unit : num [1:1414] 1 1 3 3 3 3 1 3 3 3 ...
## $ sex : chr [1:1414] "Male" "Female" "Female" "Male" ...
## $ age_35 : chr [1:1414] "Over 35" "Over 35" "Under 35" "Under 35" ...
## $ age_at_art_initiation : num [1:1414] 36 40 34.1 28.1 46.9 37.5 49.2 43.2 50.9 36.1 ...
## $ education : chr [1:1414] NA "Secondary" NA NA ...
## $ occupation : chr [1:1414] "Driver" "Laborer" "Laborer" "Laborer" ...
## $ civil_status : chr [1:1414] "Stable Union" "Stable Union" "Widowed" "Stable Union" ...
## $ who_status_at_art_initiaiton: num [1:1414] 1 1 3 1 3 2 2 2 1 1 ...
## $ bmi_initiation_art : num [1:1414] 19.4 24.7 NA NA NA ...
## $ cd4_initiation_art : num [1:1414] NA 107 NA NA NA NA 139 NA NA NA ...
## $ regimen_1 : num [1:1414] 3 6 6 6 6 3 6 3 3 6 ...
## $ nr_of_pills_day : num [1:1414] 2 1 1 1 1 2 1 2 2 1 ...
## $ na : logi [1:1414] NA NA NA NA NA NA ...
‣ Looking at our data, the only true numerical variables are
age_at_art_initation
, bmi_initiation_art
,
cd4_initiation_art
, and nr_of_pills_day
. Let’s
change all the others to factor variables using the
as.factor()
function!
‣ Change all others to factor variables using as.factor within across.
non_adherence_recoded %>%
mutate(across(
.cols = !c(age_at_art_initiation, bmi_initiation_art, cd4_initiation_art, nr_of_pills_day),
.fns = as.factor
))
## # A tibble: 1,414 × 15
## patient_id district health_unit sex age_35 age_at_art_initiation education
## <fct> <fct> <fct> <fct> <fct> <dbl> <fct>
## 1 10037 1 1 Male Over … 36 <NA>
## 2 10537 1 1 Female Over … 40 Secondary
## 3 5489 2 3 Female Under… 34.1 <NA>
## 4 5523 2 3 Male Under… 28.1 <NA>
## 5 4942 2 3 Female Over … 46.9 Universi…
## 6 4742 2 3 Male Over … 37.5 Technical
## 7 10879 1 1 Male Over … 49.2 Technical
## 8 2885 2 3 Male Over … 43.2 Technical
## 9 4861 2 3 Female Over … 50.9 Technical
## 10 5180 2 3 Male Over … 36.1 Technical
## # ℹ 1,404 more rows
## # ℹ 8 more variables: occupation <fct>, civil_status <fct>,
## # who_status_at_art_initiaiton <fct>, bmi_initiation_art <dbl>,
## # cd4_initiation_art <dbl>, regimen_1 <fct>, nr_of_pills_day <dbl>, na <fct>
‣ This should result in correct classification as expected.
Convert the variables in positions 13 to 29 in the
typhoid
dataset to factor.
## # A tibble: 215 × 31
## UniqueKey CaseorControl Age Sex Levelofeducation Householdmembers
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 0 29 0 2 01-May
## 2 2 0 31 1 1 9
## 3 3 1 21 0 1 12
## 4 4 0 47 1 0 7
## 5 5 0 39 1 1 7
## 6 6 1 46 1 0 9
## 7 7 0 58 0 1 01-May
## 8 8 0 48 0 1 7
## 9 9 1 21 1 3 10
## 10 10 0 38 1 0 7
## # ℹ 205 more rows
## # ℹ 25 more variables: Below10years <dbl>, N1119years <dbl>, N2035years <dbl>,
## # N3644years <dbl>, N4565years <dbl>, Above65years <dbl>,
## # Positioninthehousehold <fct>, Watersourcedwithinhousehold <fct>,
## # Borehole <fct>, River <fct>, Tap <fct>, Rainwatertank <fct>,
## # Unprotectedspring <fct>, Protectedspring <fct>, Pond <fct>,
## # Shallowwell <fct>, Stream <fct>, Jerrycan <fct>, Bucket <fct>, …
By the end of this lesson, you will be able to:
‣ Understand how to clean column names, both automatically and manually.
‣ Eliminate duplicate entries.
‣ Correct and fix string values in your data.
‣ Convert data types as required.
Congratulations on completing the two-part lesson on the data cleaning pipeline! You are now better equipped to tackle the cleaning of real-world datasets.
Keep practicing!
typhoid %>%
clean_names() %>%
rename_with(.fn = ~ str_replace_all(.x, pattern = "or_|of", replacement = "_")) %>%
names()
## [1] "unique_key" "case_control"
## [3] "age" "sex"
## [5] "level_education" "householdmembers"
## [7] "below10years" "n1119years"
## [9] "n2035years" "n3644years"
## [11] "n4565years" "above65years"
## [13] "positioninthehousehold" "watersourcedwithinhousehold"
## [15] "borehole" "river"
## [17] "tap" "rainwatertank"
## [19] "unprotectedspring" "protectedspring"
## [21] "pond" "shallowwell"
## [23] "stream" "jerrycan"
## [25] "bucket" "county"
## [27] "subcounty" "parish"
## [29] "village" "na"
## [31] "nan"
## No variable names specified - using all columns.
# Remove duplicates
typhoid_distinct <- typhoid %>%
distinct()
# Ensure all distinct rows left
get_dupes(typhoid_distinct)
## No variable names specified - using all columns.
## No duplicate combinations found of: UniqueKey, CaseorControl, Age, Sex, Levelofeducation, Householdmembers, Below10years, N1119years, N2035years, ... and 22 other variables
The following team members contributed to this lesson: