Proficiency in string manipulation is a vital skill for data scientists. Tasks like cleaning messy data and formatting outputs rely heavily on the ability to parse, combine, and modify character strings. This lesson focuses on techniques for working with strings in R, utilizing functions from the {stringr} package in the tidyverse. Let’s dive in!
Understand the concept of strings and rules for defining them in R
Use escapes to include special characters like quotes within strings
Employ {stringr} functions to format strings:
str_to_lower()
,
str_to_upper()
, str_to_title()
str_trim()
and
str_squish()
str_pad()
str_wrap()
Split strings into parts using str_split()
and
separate()
Combine strings together with paste()
and
paste0()
Extract substrings from strings using
str_sub()
# Loading required packages
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, here, janitor, openxlsx, gtsummary)
‣ Character strings in R can be defined using single or double quotes.
‣ Matching starting and ending quotation marks are necessary.
‣ Cannot include double quotes inside a string that starts and ends with double quotes. The same applies to single quotes inside a string that starts and ends with single quotes.
will_not_work <- "Double quotes " inside double quotes"
will_not_work <- 'Single quotes ' inside double quotes'
‣ Mixing single quotes inside double quotes, and vice versa, is allowed.
‣ Use the escape character \
to include quotes within
strings.
single_quote <- 'Single quotes \' inside sing quotes'
double_quote <- "Double quotes \" inside double quotes"
‣ cat()
function is used to display strings as
output.
## Single quotes ' inside sing quotes
## Double quotes " inside double quotes
Since \
is the escape character, you must use
\\
to include a literal backslash in a string:
## This is a backslash: \
PRACTICE TIME !
Below are attempts to define character strings in R, with two out of five lines containing an error. Identify and correct these errors.
‣ {stringr} package helps in formatting strings for analysis and visualization.
‣ Case changes
‣ Handling whitespace
‣ Standardizing length
‣ Text wrapping
‣ Standardizing strings or preparing them for display often requires case conversion.
‣ str_to_upper()
converts strings to uppercase.
## [1] "HELLO WORLD"
‣ str_to_lower()
converts strings to lowercase.
## [1] "goodbye"
‣ str_to_title()
capitalizes the first letter of each
word. Ideal for titling names, subjects, etc.
## [1] "String Manipulation"
‣ Strings can be made neat and uniform by managing whitespace.
‣ Use str_trim()
to remove leading and trailing
whitespace.
## [1] "trimmed"
‣ str_squish()
also removes whitespace at the start and
end, and reduces multiple internal spaces to one.
## [1] "too much space"
## [1] "too much space"
‣ str_pad()
is used to pad strings to a specified
width.
‣ It helps to standardize the length of strings by adding characters.
## [1] "007"
‣ first argument: the string to pad
‣ width
sets the final string width, pad
specifies the padding character.
‣ side
argument can be “left”, “right”, or “both”.
‣ on the right:
# Pad the number "7" on the right to length 4 with "_"
str_pad("7", width = 4, side = "right", pad = "_")
## [1] "7___"
‣ on both sides:
# Pad the number "7" on both sides to a total width of 5 with "_"
str_pad("7", width = 5, side = "both", pad = "_")
## [1] "__7__"
‣ str_wrap()
wraps text to fit a set width, useful for
confined spaces.
example_string <- "String Manipulation with str_wrap can enhance readability in plots."
wrapped_to_10 <- str_wrap(example_string, width = 10)
wrapped_to_10
## [1] "String\nManipulation\nwith\nstr_wrap\ncan\nenhance\nreadability\nin plots."
‣ cat()
displays strings with line breaks, making them
readable.
## String
## Manipulation
## with
## str_wrap
## can
## enhance
## readability
## in plots.
‣ Setting the width to 1 essentially splits the string into individual words:
## String
## Manipulation
## with
## str_wrap
## can
## enhance
## readability
## in
## plots.
‣ Here’s an example of using str_wrap()
in ggplot2 for
neat titles:
long_title <- "This is an example of a very long title, which would usually run over the end of your ggplot, but you can wrap it with str_wrap to fit within a specified character limit."
# Example plot without title wrapping
ggplot(women, aes(height, weight)) +
geom_point() +
labs(title = long_title)
# Now, add wrapped title at 80 characters
ggplot(women, aes(height, weight)) +
geom_point() +
labs(title = str_wrap(long_title, width = 50))
PRACTICE TIME !
A dataset contains patient names with inconsistent formatting and extra white spaces. Use the {stringr} package to standardize this information:
The following (fictional) drug codes are inconsistently formatted. Standardize them by padding with zeros to ensure all codes are 8 characters long:
Use str_wrap()
to format the following for better
readability:
instructions <- "Take two tablets daily after meals. If symptoms persist for more than three days, consult your doctor immediately. Do not take more than the recommended dose. Keep out of reach of children."
ggplot(data.frame(x = 1, y = 1), aes(x, y, label = instructions)) +
geom_label() +
theme_void()
# Now, wrap the instructions to a width of 50 characters then plot again.
ggplot(data.frame(x = 1, y = 1), aes(x, y, label = str_wrap(instructions, width =50))) +
geom_label() +
theme_void()
‣ We’ll learn to clean and standardize data using {stringr} functions.
‣ Our focus: a dataset on HIV care in Zambézia Province, Mozambique.
‣ The dataset contains formatting inconsistencies intentionally added for learning.
# Load the messy dataset
hiv_dat_messy_1 <- openxlsx::read.xlsx(here("data/hiv_dat_messy_1.xlsx")) %>%
as_tibble()
# Observe the formatting issues in these columns
hiv_dat_messy_1 %>%
select(district, health_unit, education, regimen)
## # A tibble: 1,413 × 4
## district health_unit education regimen
## <chr> <chr> <chr> <chr>
## 1 "Rural" District Hospital Maganja Da Costa MISSING AZT+3TC+NVP
## 2 "Rural" District Hospital Maganja Da Costa secondary TDF+3TC+EFV
## 3 "Urban" 24th Of July Health Facility MISSING tdf+3tc+efv
## 4 "Urban" 24th Of July Health Facility MISSING TDF+3TC+EFV
## 5 " Urban" 24th Of July Health Facility University tdf+3tc+efv
## 6 "Urban" 24th Of July Health Facility Technical AZT+3TC+NVP
## 7 "Rural" District Hospital Maganja Da Costa Technical TDF+3TC+EFV
## 8 "Urban" 24th Of July Health Facility Technical azt+3tc+nvp
## 9 "Urban" 24th Of July Health Facility Technical AZT+3TC+NVP
## 10 "Urban" 24th Of July Health Facility Technical TDF+3TC+EFV
## # ℹ 1,403 more rows
‣ Use tabyl
to count and identify unique values,
highlighting inconsistencies.
## health_unit n percent
## 24th Of July Health Facility 239 0.16914367
## 24th Of July Health Facility 249 0.17622081
## District Hospital Maganja Da Costa 342 0.24203822
## District Hospital Maganja Da Costa 336 0.23779193
## Nante Health Facility 119 0.08421798
## Nante Health Facility 128 0.09058740
## education n percent
## MISSING 776 0.549186129
## None 128 0.090587403
## Primary 178 0.125973107
## Secondary 82 0.058032555
## Technical 17 0.012031139
## University 4 0.002830856
## primary 157 0.111111111
## secondary 71 0.050247700
## regimen n percent valid_percent
## AZT+3TC+EFV 24 0.0169851380 0.0179910045
## AZT+3TC+NVP 229 0.1620665251 0.1716641679
## D4T+3TC+ABC 1 0.0007077141 0.0007496252
## D4T+3TC+EFV 2 0.0014154282 0.0014992504
## D4T+3TC+NVP 16 0.0113234253 0.0119940030
## OTHER 1 0.0007077141 0.0007496252
## TDF+3TC+EFV 404 0.2859164897 0.3028485757
## TDF+3TC+NVP 3 0.0021231423 0.0022488756
## azt+3tc+efv 16 0.0113234253 0.0119940030
## azt+3tc+nvp 231 0.1634819533 0.1731634183
## d4t+3tc+efv 9 0.0063694268 0.0067466267
## d4t+3tc+nvp 18 0.0127388535 0.0134932534
## d4t+4tc+nvp 1 0.0007077141 0.0007496252
## d4t6+3tc+nvp 2 0.0014154282 0.0014992504
## other 2 0.0014154282 0.0014992504
## tdf+3tc+efv 374 0.2646850672 0.2803598201
## tdf+3tc+nvp 1 0.0007077141 0.0007496252
## <NA> 79 0.0559094126 NA
## district n percent
## Rural 234 0.16560510
## Urban 118 0.08351026
## Rural 691 0.48903043
## Urban 370 0.26185421
‣ tbl_summary
from {gtsummary} visualizes casing,
spacing, and format issues.
# Summarize data to view inconsistencies before cleaning
if (packageVersion("glue") < "1.8.0") install.packages("glue")
library(gtsummary)
hiv_dat_messy_1 %>%
select(district, health_unit, education,regimen) %>%
tbl_summary()
Characteristic | N = 1,4131 |
---|---|
district | |
Rural | 234 (17%) |
Urban | 118 (8.4%) |
Rural | 691 (49%) |
Urban | 370 (26%) |
health_unit | |
24th Of July Health Facility | 239 (17%) |
24th Of July Health Facility | 249 (18%) |
District Hospital Maganja Da Costa | 342 (24%) |
District Hospital Maganja Da Costa | 336 (24%) |
Nante Health Facility | 119 (8.4%) |
Nante Health Facility | 128 (9.1%) |
education | |
MISSING | 776 (55%) |
None | 128 (9.1%) |
primary | 157 (11%) |
Primary | 178 (13%) |
secondary | 71 (5.0%) |
Secondary | 82 (5.8%) |
Technical | 17 (1.2%) |
University | 4 (0.3%) |
regimen | |
azt+3tc+efv | 16 (1.2%) |
AZT+3TC+EFV | 24 (1.8%) |
azt+3tc+nvp | 231 (17%) |
AZT+3TC+NVP | 229 (17%) |
D4T+3TC+ABC | 1 (<0.1%) |
d4t+3tc+efv | 9 (0.7%) |
D4T+3TC+EFV | 2 (0.1%) |
d4t+3tc+nvp | 18 (1.3%) |
D4T+3TC+NVP | 16 (1.2%) |
d4t+4tc+nvp | 1 (<0.1%) |
d4t6+3tc+nvp | 2 (0.1%) |
other | 2 (0.1%) |
OTHER | 1 (<0.1%) |
tdf+3tc+efv | 374 (28%) |
TDF+3TC+EFV | 404 (30%) |
tdf+3tc+nvp | 1 (<0.1%) |
TDF+3TC+NVP | 3 (0.2%) |
Unknown | 79 |
1 n (%) |
‣ Next, we systematically clean each variable for consistency.
library(dplyr)
library(stringr)
# Apply cleaning functions to standardize data
hiv_dat_clean_1 <- hiv_dat_messy_1 %>%
mutate(
district = str_to_title(str_trim(district)), # Standardize district names
health_unit = str_squish(health_unit), # Remove extra spaces
education = str_to_title(education), # Standardize education levels
regimen = str_to_upper(regimen) # Regimen column consistency
)
‣ Confirm improvements by re-running tbl_summary()
.
# Check the cleaned data
hiv_dat_clean_1 %>%
select(district, health_unit, education, regimen) %>%
tbl_summary()
Characteristic | N = 1,4131 |
---|---|
district | |
Rural | 925 (65%) |
Urban | 488 (35%) |
health_unit | |
24th Of July Health Facility | 488 (35%) |
District Hospital Maganja Da Costa | 678 (48%) |
Nante Health Facility | 247 (17%) |
education | |
Missing | 776 (55%) |
None | 128 (9.1%) |
Primary | 335 (24%) |
Secondary | 153 (11%) |
Technical | 17 (1.2%) |
University | 4 (0.3%) |
regimen | |
AZT+3TC+EFV | 40 (3.0%) |
AZT+3TC+NVP | 460 (34%) |
D4T+3TC+ABC | 1 (<0.1%) |
D4T+3TC+EFV | 11 (0.8%) |
D4T+3TC+NVP | 34 (2.5%) |
D4T+4TC+NVP | 1 (<0.1%) |
D4T6+3TC+NVP | 2 (0.1%) |
OTHER | 3 (0.2%) |
TDF+3TC+EFV | 778 (58%) |
TDF+3TC+NVP | 4 (0.3%) |
Unknown | 79 |
1 n (%) |
‣ Address plotting issues with ggplot
due to lengthy
health_unit
labels.
# Use str_wrap to adjust label lengths for better plot display
hiv_dat_clean_1 %>%
ggplot(aes(x = str_wrap(health_unit, width = 20))) +
geom_bar()
‣ Refine the plot by correcting the axis title.
# Finalize plot adjustments
hiv_dat_clean_1 %>%
ggplot(aes(x = str_wrap(health_unit, width = 20))) +
geom_bar() +
labs(x = "Health Unit")
PRACTICE TIME!
In this exercise, you will clean a dataset, lima_messy
,
originating from a tuberculosis (TB) treatment adherence study in Lima,
Peru. More details about the study and the dataset are available here.
Begin by importing the dataset:
## # A tibble: 1,293 × 18
## id age sex marital_status poverty_level prison_history
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 pe-1008 38 and older M Single Not in pover… No
## 2 lm-1009 38 and older M Married / cohabi… Not in pover… No
## 3 pe-1010 27 to 37 m Married / cohabit… Not in pover… No
## 4 lm-1011 27 to 37 m Married / cohabit… Poverty/extr… No
## 5 pe-1012 38 and older m Married / cohabita… Not in pover… No
## 6 lm-1013 27 to 37 M Single Poverty/extr… No
## 7 pe-1014 27 To 37 m Married / cohabita… Not in pover… No
## 8 lm-1015 22 To 26 m Single Poverty/extr… Yes
## 9 pe-1016 27 to 37 m Single Not in pover… No
## 10 lm-1017 22 to 26 m Single Not in pover… No
## # ℹ 1,283 more rows
## # ℹ 12 more variables: completed_secondary_education <chr>,
## # history_of_tobacco_use <chr>, alcohol_use_at_least_once_per_week <chr>,
## # history_of_drug_use <chr>, history_of_rehab <chr>, mdr_tb <chr>,
## # body_mass_index <chr>, history_chronic_disease <chr>, hiv_status <chr>,
## # history_diabetes_melitus <chr>, treatment_outcome <chr>,
## # time_to_default_days <dbl>
Your task is to clean the marital_status
,
sex
, and age
variables in
lima_messy
. Following the cleaning process, generate a
summary table using the tbl_summary()
function. Aim for
your output to align with this structure:
Characteristic | N = 1,293 |
---|---|
marital_status | |
Divorced / Separated | 93 (7.2%) |
Married / Cohabitating | 486 (38%) |
Single | 677 (52%) |
Widowed | 37 (2.9%) |
sex | |
F | 503 (39%) |
M | 790 (61%) |
age | |
21 and younger | 338 (26%) |
22 to 26 | 345 (27%) |
27 to 37 | 303 (23%) |
38 and older | 307 (24%) |
Implement the cleaning and summarize:
# Create a new object for cleaned data
lima_clean <- lima_messy_1 %>%
mutate(marital_status = str_to_title(str_squish(marital_status)),
# Clean marital_status
sex = str_to_upper(sex),
# Clean sex
age = str_to_lower(age)
# Clean age
)
# Check cleaning
lima_clean %>%
select(marital_status, sex, age) %>%
tbl_summary()
Using the cleaned dataset lima_clean
from the previous
task, create a bar plot to display the count of participants by
marital_status
. Then wrap the axis labels on the x-axis to
a maximum of 15 characters per line for readability.
‣ Common data manipulation tasks include splitting and combining strings.
‣ stringr::str_split()
and
tidyr::separate()
are tidyverse functions for this
purpose.
str_split()
‣ str_split()
divides strings into parts.
‣ To split example_string
at each hyphen:
## [[1]]
## [1] "split" "this" "string"
‣ Direct application to a dataframe is complex.
‣ With IRS dataset, focus on start_date_long
:
irs <- read_csv(here("data/Illovo_data.csv"))
irs_dates_1 <- irs %>% select(village, start_date_long)
irs_dates_1
## # A tibble: 112 × 2
## village start_date_long
## <chr> <chr>
## 1 Mess April 07 2014
## 2 Nkombedzi April 22 2014
## 3 B Compound May 13 2014
## 4 D Compound May 13 2014
## 5 Post Office May 13 2014
## 6 Mangulenje May 15 2014
## 7 Mangulenje Senior May 27 2014
## 8 Old School May 27 2014
## 9 Mwanza May 28 2014
## 10 Alumenda June 18 2014
## # ℹ 102 more rows
‣ To extract month, day, and year from
start_date_long
:
## # A tibble: 112 × 3
## village start_date_long start_date_parts
## <chr> <chr> <list>
## 1 Mess April 07 2014 <chr [3]>
## 2 Nkombedzi April 22 2014 <chr [3]>
## 3 B Compound May 13 2014 <chr [3]>
## 4 D Compound May 13 2014 <chr [3]>
## 5 Post Office May 13 2014 <chr [3]>
## 6 Mangulenje May 15 2014 <chr [3]>
## 7 Mangulenje Senior May 27 2014 <chr [3]>
## 8 Old School May 27 2014 <chr [3]>
## 9 Mwanza May 28 2014 <chr [3]>
## 10 Alumenda June 18 2014 <chr [3]>
## # ℹ 102 more rows
‣ For readability, use unnest_wider()
:
irs_dates_1 %>%
mutate(start_date_parts = str_split(start_date_long, " ")) %>%
unnest_wider(start_date_parts, names_sep = "_")
## # A tibble: 112 × 5
## village start_date_long start_date_parts_1 start_date_parts_2
## <chr> <chr> <chr> <chr>
## 1 Mess April 07 2014 April 07
## 2 Nkombedzi April 22 2014 April 22
## 3 B Compound May 13 2014 May 13
## 4 D Compound May 13 2014 May 13
## 5 Post Office May 13 2014 May 13
## 6 Mangulenje May 15 2014 May 15
## 7 Mangulenje Senior May 27 2014 May 27
## 8 Old School May 27 2014 May 27
## 9 Mwanza May 28 2014 May 28
## 10 Alumenda June 18 2014 June 18
## # ℹ 102 more rows
## # ℹ 1 more variable: start_date_parts_3 <chr>
separate()
‣ separate()
is more straightforward for splitting.
‣ To split into month
, day
,
year
:
## # A tibble: 112 × 4
## village month day year
## <chr> <chr> <chr> <chr>
## 1 Mess April 07 2014
## 2 Nkombedzi April 22 2014
## 3 B Compound May 13 2014
## 4 D Compound May 13 2014
## 5 Post Office May 13 2014
## 6 Mangulenje May 15 2014
## 7 Mangulenje Senior May 27 2014
## 8 Old School May 27 2014
## 9 Mwanza May 28 2014
## 10 Alumenda June 18 2014
## # ℹ 102 more rows
‣ the separate()
requires specifying:
the column to be split
into: names of the new columns
sep: separator character
‣ To keep the original column:
irs_dates_1 %>%
separate(start_date_long, into = c("month", "day", "year"), sep = " ", remove = FALSE)
## # A tibble: 112 × 5
## village start_date_long month day year
## <chr> <chr> <chr> <chr> <chr>
## 1 Mess April 07 2014 April 07 2014
## 2 Nkombedzi April 22 2014 April 22 2014
## 3 B Compound May 13 2014 May 13 2014
## 4 D Compound May 13 2014 May 13 2014
## 5 Post Office May 13 2014 May 13 2014
## 6 Mangulenje May 15 2014 May 15 2014
## 7 Mangulenje Senior May 27 2014 May 27 2014
## 8 Old School May 27 2014 May 27 2014
## 9 Mwanza May 28 2014 May 28 2014
## 10 Alumenda June 18 2014 June 18 2014
## # ℹ 102 more rows
Alternatively, the lubridate
package offers functions to
extract date components:
irs_dates_1 %>%
mutate(start_date_long = mdy(start_date_long)) %>%
mutate(day = day(start_date_long),
month = month(start_date_long, label = TRUE),
year = year(start_date_long))
## # A tibble: 112 × 5
## village start_date_long day month year
## <chr> <date> <int> <ord> <dbl>
## 1 Mess 2014-04-07 7 Apr 2014
## 2 Nkombedzi 2014-04-22 22 Apr 2014
## 3 B Compound 2014-05-13 13 May 2014
## 4 D Compound 2014-05-13 13 May 2014
## 5 Post Office 2014-05-13 13 May 2014
## 6 Mangulenje 2014-05-15 15 May 2014
## 7 Mangulenje Senior 2014-05-27 27 May 2014
## 8 Old School 2014-05-27 27 May 2014
## 9 Mwanza 2014-05-28 28 May 2014
## 10 Alumenda 2014-06-18 18 Jun 2014
## # ℹ 102 more rows
‣ If rows miss parts, separate()
warns
‣ Demonstrating with dates missing “April”:
irs_dates_with_problem <-
irs_dates_1 %>%
mutate(start_date_missing = str_replace(start_date_long, "April ", ""))
irs_dates_with_problem
## # A tibble: 112 × 3
## village start_date_long start_date_missing
## <chr> <chr> <chr>
## 1 Mess April 07 2014 07 2014
## 2 Nkombedzi April 22 2014 22 2014
## 3 B Compound May 13 2014 May 13 2014
## 4 D Compound May 13 2014 May 13 2014
## 5 Post Office May 13 2014 May 13 2014
## 6 Mangulenje May 15 2014 May 15 2014
## 7 Mangulenje Senior May 27 2014 May 27 2014
## 8 Old School May 27 2014 May 27 2014
## 9 Mwanza May 28 2014 May 28 2014
## 10 Alumenda June 18 2014 June 18 2014
## # ℹ 102 more rows
‣ Splitting with missing parts:
irs_dates_with_problem %>%
separate(start_date_missing, into = c("month", "day", "year"), sep = " ")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 3 rows [1, 2,
## 12].
## # A tibble: 112 × 5
## village start_date_long month day year
## <chr> <chr> <chr> <chr> <chr>
## 1 Mess April 07 2014 07 2014 <NA>
## 2 Nkombedzi April 22 2014 22 2014 <NA>
## 3 B Compound May 13 2014 May 13 2014
## 4 D Compound May 13 2014 May 13 2014
## 5 Post Office May 13 2014 May 13 2014
## 6 Mangulenje May 15 2014 May 15 2014
## 7 Mangulenje Senior May 27 2014 May 27 2014
## 8 Old School May 27 2014 May 27 2014
## 9 Mwanza May 28 2014 May 28 2014
## 10 Alumenda June 18 2014 June 18 2014
## # ℹ 102 more rows
‣ Now we have the day and month in the wrong columns for some rows.
Consider the esoph_ca
dataset, from the {medicaldata}
package, which involves a case-control study of esophageal cancer in
France.
## # A tibble: 88 × 5
## agegp alcgp tobgp ncases ncontrols
## <ord> <ord> <ord> <dbl> <dbl>
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
## 7 25-34 40-79 20-29 0 4
## 8 25-34 40-79 30+ 0 7
## 9 25-34 80-119 0-9g/day 0 2
## 10 25-34 80-119 10-19 0 1
## # ℹ 78 more rows
Split the age ranges in the agegp
column into two
separate columns: agegp_lower
and
agegp_upper
.
After using the separate()
function, the “75+” age group
will require special handling. Use readr::parse_number()
or
another method to convert the lower age limit (“75+”) to a number.
library(dplyr)
library(tidyr)
library(readr)
medicaldata::esoph_ca %>%
separate(agegp, into = c("agegp_lower", "agegp_upper"), sep = "-", remove = FALSE) %>%
# convert 75+ to a number
mutate(# Parse numbers in both columns
agegp_lower = parse_number(agegp_lower),
agegp_upper = if_else(str_detect(agegp_upper, "\\+"),
parse_number(agegp_upper),
parse_number(agegp_upper)))
‣ To use the separate()
function on special characters
(., +, *, ?) need to be escaped in \\
‣ Consider the scenario where dates are formatted with periods:
# Correct separation of dates with periods
irs_with_period <- irs_dates_1 %>%
mutate(start_date_long = format(lubridate::mdy(start_date_long), "%d.%m.%Y"))
irs_with_period
## # A tibble: 112 × 2
## village start_date_long
## <chr> <chr>
## 1 Mess 07.04.2014
## 2 Nkombedzi 22.04.2014
## 3 B Compound 13.05.2014
## 4 D Compound 13.05.2014
## 5 Post Office 13.05.2014
## 6 Mangulenje 15.05.2014
## 7 Mangulenje Senior 27.05.2014
## 8 Old School 27.05.2014
## 9 Mwanza 28.05.2014
## 10 Alumenda 18.06.2014
## # ℹ 102 more rows
‣ When attempting to separate this date format directly with
sep = "."
:
## # A tibble: 112 × 4
## village day month year
## <chr> <chr> <chr> <chr>
## 1 Mess "" "" ""
## 2 Nkombedzi "" "" ""
## 3 B Compound "" "" ""
## 4 D Compound "" "" ""
## 5 Post Office "" "" ""
## 6 Mangulenje "" "" ""
## 7 Mangulenje Senior "" "" ""
## 8 Old School "" "" ""
## 9 Mwanza "" "" ""
## 10 Alumenda "" "" ""
## # ℹ 102 more rows
‣ This is because, in regex (regular expressions), the period is a special character.
‣ The correct approach is to escape the period uses a double backslash (\):
## # A tibble: 112 × 4
## village day month year
## <chr> <chr> <chr> <chr>
## 1 Mess 07 04 2014
## 2 Nkombedzi 22 04 2014
## 3 B Compound 13 05 2014
## 4 D Compound 13 05 2014
## 5 Post Office 13 05 2014
## 6 Mangulenje 15 05 2014
## 7 Mangulenje Senior 27 05 2014
## 8 Old School 27 05 2014
## 9 Mwanza 28 05 2014
## 10 Alumenda 18 06 2014
## # ℹ 102 more rows
‣ Now, the function understands to split the string at each literal period.
‣ When using other special characters like +
,
*
, or ?
, they need to be preceded with a
double backslash (\) in the sep
argument.
What is a Special Character?
In regular expressions, which help find patterns in text, special
characters have specific roles. For example, a period (.) is a wildcard
that can represent any character. So, in a search, “do.t” could match
“dolt,” “dost,” or “doct” Similarly, the plus sign (+) is used to
indicate one or more occurrences of the preceding character. For
example, “ho+se” would match “hose” or “hooose” but not “hse.” When we
need to use these characters in their ordinary roles, we use a double
backslash (\\
) before them, like “\\.
” or
“\\+.
” More on these special characters will be covered in
a future lesson.
Your next task involves the hiv_dat_clean_1
dataset.
Focus on the regimen
column, which lists drug regimens
separated by a +
sign. Your goal is to split this column
into three new columns: drug_1
, drug_2
, and
drug_3
using the separate()
function. Pay
close attention to how you handle the +
separator. Here’s
the column:
hiv_dat_clean_1 %>%
select(regimen) %>%
separate(regimen, into = c("drug_1", "drug_2", "drug_3"), sep = "\\+", remove=F)
## # A tibble: 1,413 × 4
## regimen drug_1 drug_2 drug_3
## <chr> <chr> <chr> <chr>
## 1 AZT+3TC+NVP AZT 3TC NVP
## 2 TDF+3TC+EFV TDF 3TC EFV
## 3 TDF+3TC+EFV TDF 3TC EFV
## 4 TDF+3TC+EFV TDF 3TC EFV
## 5 TDF+3TC+EFV TDF 3TC EFV
## 6 AZT+3TC+NVP AZT 3TC NVP
## 7 TDF+3TC+EFV TDF 3TC EFV
## 8 AZT+3TC+NVP AZT 3TC NVP
## 9 AZT+3TC+NVP AZT 3TC NVP
## 10 TDF+3TC+EFV TDF 3TC EFV
## # ℹ 1,403 more rows
paste()
‣ Concatenate strings with paste()
‣ To combine two simple strings:
## [1] "Hello World"
‣ Let’s demonstrate this with the IRS data.
‣ First, we’ll separate the start date into individual columns:
irs_dates_separated <- # store for later use
irs_dates_1 %>%
separate(start_date_long, into = c("month", "day", "year"), sep = " ", remove = FALSE)
irs_dates_separated
## # A tibble: 112 × 5
## village start_date_long month day year
## <chr> <chr> <chr> <chr> <chr>
## 1 Mess April 07 2014 April 07 2014
## 2 Nkombedzi April 22 2014 April 22 2014
## 3 B Compound May 13 2014 May 13 2014
## 4 D Compound May 13 2014 May 13 2014
## 5 Post Office May 13 2014 May 13 2014
## 6 Mangulenje May 15 2014 May 15 2014
## 7 Mangulenje Senior May 27 2014 May 27 2014
## 8 Old School May 27 2014 May 27 2014
## 9 Mwanza May 28 2014 May 28 2014
## 10 Alumenda June 18 2014 June 18 2014
## # ℹ 102 more rows
‣ Then, recombine day, month and year with paste()
:
irs_dates_separated %>%
select(day, month, year) %>%
mutate(start_date_long_2 = paste(day, month, year))
## # A tibble: 112 × 4
## day month year start_date_long_2
## <chr> <chr> <chr> <chr>
## 1 07 April 2014 07 April 2014
## 2 22 April 2014 22 April 2014
## 3 13 May 2014 13 May 2014
## 4 13 May 2014 13 May 2014
## 5 13 May 2014 13 May 2014
## 6 15 May 2014 15 May 2014
## 7 27 May 2014 27 May 2014
## 8 27 May 2014 27 May 2014
## 9 28 May 2014 28 May 2014
## 10 18 June 2014 18 June 2014
## # ℹ 102 more rows
‣ sep
argument specifies the separator between
elements
‣ For different separators, we can write:
## # A tibble: 112 × 6
## village start_date_long month day year start_date_long_2
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Mess April 07 2014 April 07 2014 07-April-2014
## 2 Nkombedzi April 22 2014 April 22 2014 22-April-2014
## 3 B Compound May 13 2014 May 13 2014 13-May-2014
## 4 D Compound May 13 2014 May 13 2014 13-May-2014
## 5 Post Office May 13 2014 May 13 2014 13-May-2014
## 6 Mangulenje May 15 2014 May 15 2014 15-May-2014
## 7 Mangulenje Senior May 27 2014 May 27 2014 27-May-2014
## 8 Old School May 27 2014 May 27 2014 27-May-2014
## 9 Mwanza May 28 2014 May 28 2014 28-May-2014
## 10 Alumenda June 18 2014 June 18 2014 18-June-2014
## # ℹ 102 more rows
‣ To concatenate without spaces, we can set
sep = ""
:
irs_dates_separated %>%
select(day, month, year) %>%
mutate(start_date_long_2 = paste(day, month, year, sep = ""))
## # A tibble: 112 × 4
## day month year start_date_long_2
## <chr> <chr> <chr> <chr>
## 1 07 April 2014 07April2014
## 2 22 April 2014 22April2014
## 3 13 May 2014 13May2014
## 4 13 May 2014 13May2014
## 5 13 May 2014 13May2014
## 6 15 May 2014 15May2014
## 7 27 May 2014 27May2014
## 8 27 May 2014 27May2014
## 9 28 May 2014 28May2014
## 10 18 June 2014 18June2014
## # ℹ 102 more rows
‣ Or use paste0()
function, which is equivalent to
paste(..., sep = "")
:
irs_dates_separated %>%
select(day, month, year) %>%
mutate(start_date_long_2 = paste0(day, month, year))
## # A tibble: 112 × 4
## day month year start_date_long_2
## <chr> <chr> <chr> <chr>
## 1 07 April 2014 07April2014
## 2 22 April 2014 22April2014
## 3 13 May 2014 13May2014
## 4 13 May 2014 13May2014
## 5 13 May 2014 13May2014
## 6 15 May 2014 15May2014
## 7 27 May 2014 27May2014
## 8 27 May 2014 27May2014
## 9 28 May 2014 28May2014
## 10 18 June 2014 18June2014
## # ℹ 102 more rows
‣ Combine paste()
with other string functions to solve a
realistic data problem.
‣ Consider the ID column in the hiv_dat_messy_1
dataset:
## # A tibble: 1,413 × 1
## patient_id
## <chr>
## 1 pd-10037
## 2 pd-10537
## 3 pd-5489
## 4 id-5523
## 5 pd-4942
## 6 pd-4742
## 7 pd-10879
## 8 id-2885
## 9 pd-4861
## 10 pd-5180
## # ℹ 1,403 more rows
‣ Standardize these IDs to the same number of characters.
‣ Use separate()
to split the IDs into parts, then use
paste()
to recombine them:
hiv_dat_messy_1 %>%
select(patient_id) %>% # for visibility
separate(patient_id, into = c("prefix", "patient_num"), sep = "-", remove =F) %>%
mutate(patient_num = str_pad(patient_num, width = 5, side = "left", pad = "0")) %>%
mutate(patient_id_padded = paste(prefix, patient_num, sep = "-"))
## # A tibble: 1,413 × 4
## patient_id prefix patient_num patient_id_padded
## <chr> <chr> <chr> <chr>
## 1 pd-10037 pd 10037 pd-10037
## 2 pd-10537 pd 10537 pd-10537
## 3 pd-5489 pd 05489 pd-05489
## 4 id-5523 id 05523 id-05523
## 5 pd-4942 pd 04942 pd-04942
## 6 pd-4742 pd 04742 pd-04742
## 7 pd-10879 pd 10879 pd-10879
## 8 id-2885 id 02885 id-02885
## 9 pd-4861 pd 04861 pd-04861
## 10 pd-5180 pd 05180 pd-05180
## # ℹ 1,403 more rows
‣ In this example, patient_id
is split into a prefix and
a number.
‣ The number is padded with zeros to ensure consistent length
‣ They’re concatenated back together using paste()
with a
hyphen as the separator.
lima_messy_1
DatasetIn the lima_messy_1
dataset, the IDs are not
zero-padded, making them hard to sort.
For example, the ID pe-998
is at the top of the list
after sorting in descending order, which is not what we want.
lima_messy_1 %>%
select(id) %>%
arrange(desc(id)) # sort in descending order (highest IDs should be at the top)
## # A tibble: 1,293 × 1
## id
## <chr>
## 1 pe-998
## 2 pe-996
## 3 pe-951
## 4 pe-900
## 5 pe-2347
## 6 pe-2337
## 7 pe-2335
## 8 pe-2333
## 9 pe-2331
## 10 pe-2329
## # ℹ 1,283 more rows
Try to fix this using a similar procedure to the one used for
hiv_dat_messy_1
.
Your Task:
paste()
.2347
lima_messy_1 %>%
select(id) %>%
separate(id, into=c("prefix", "patient_num" ), sep="-", remove=F) %>%
mutate(patient_num = str_pad(patient_num, width = 4, side = "left", pad="0")) %>%
mutate(id_padded = paste(prefix, patient_num, sep="-")) %>%
select(id_padded) %>%
arrange(desc(id_padded)) # sort in descending order (highest IDs should be at the top)
Create a column containing summary statements combining
village
, start_date_default
, and
coverage_p
from the irs
dataset. The statement
should describe the spray coverage for each village.
Desired Output: “For village X, the spray coverage was Y% on Z date.”
Your Task: - Select the necessary columns from the
irs
dataset. - Use paste()
to create the
summary statement.
str_sub
‣ str_sub
is used to extract parts of a string based on
character positions
‣ Basic syntax: str_sub(string, start, end)
‣ Example: Extracting first 2 characters from patient IDs
## [1] "ID" "ID"
‣ To extract other characters, like the first 5, adjust the
start
and end
values
## [1] "ID123" "ID678"
‣ Negative values count backward from the string end, useful for suffixes
‣ Examples: Get the last 4 characters of patient IDs:
## [1] "-abc" "-def"
‣ str_sub
will not error out if indices exceed string
length
## [1] "ID12345-abc" "ID67890-def"
‣ Within mutate()
, str_sub
can be used to
transform columns in a data frame
‣ Example: Extracting year and month from
start_date_default
column and create a new column called
year_month
:
irs %>%
select(start_date_default) %>%
mutate(year_month = str_sub(start_date_default, start = 1, end = 7))
## # A tibble: 112 × 2
## start_date_default year_month
## <date> <chr>
## 1 2014-04-07 2014-04
## 2 2014-04-22 2014-04
## 3 2014-05-13 2014-05
## 4 2014-05-13 2014-05
## 5 2014-05-13 2014-05
## 6 2014-05-15 2014-05
## 7 2014-05-27 2014-05
## 8 2014-05-27 2014-05
## 9 2014-05-28 2014-05
## 10 2014-06-18 2014-06
## # ℹ 102 more rows
PRACTICE TIME!
Congratulations on reaching the end of this lesson! You’ve learned about strings in R and various functions to manipulate them effectively.
The table below gives a quick recap of the key functions we covered. Remember, you don’t need to memorize all these functions. Knowing they exist and how to look them up (like using Google) is more than enough for practical applications.
Function | Description | Example | Example Output |
str_to_upper() |
Convert characters to uppercase | str_to_upper("hiv") |
“HIV” |
str_to_lower() |
Convert characters to lowercase | str_to_lower("HIV") |
“hiv” |
str_to_title() |
Convert first character of each word to uppercase | str_to_title("hiv awareness") |
“Hiv Awareness” |
str_trim() |
Remove whitespace from start & end | str_trim(" hiv ") |
“hiv” |
str_squish() |
Remove whitespace from start & end and reduce internal spaces | str_squish(" hiv cases ") |
“hiv cases” |
str_pad() |
Pad a string to a fixed width | str_pad("45", width = 5) |
“00045” |
str_wrap() |
Wrap a string to a given width (for formatting output) | str_wrap("HIV awareness", width = 5) |
“HIV” |
str_split() |
Split elements of a character vector | str_split("Hello-World", "-") |
c(“Hello”, “World”) |
paste() |
Concatenate vectors after converting to character | paste("Hello", "World") |
“Hello World” |
str_sub() |
Extract and replace substrings from a character vector | str_sub("HelloWorld", 1, 4) |
“Hell” |
separate() |
Separate a character column into multiple columns |
|
|b |c | |Hello |World | |
Note that while these functions cover common tasks such as string standardization, splitting and joining strings, this introduction only scratches the surface of what’s possible with the {stringr} package. If you work with a lot of raw text data, you may want to do further exploring on the stringr website.
ex_a
: Correct.ex_b
: Correct.ex_c
: Error. Corrected version:
ex_c <- "They've been \"best friends\" for years."
ex_d
: Error. Corrected version:
ex_d <- 'Jane\'s diary'
ex_e
: Error. Close quote missing.
Corrected version: ex_e <- "It's a sunny day!"
instructions <- "Take two tablets daily after meals. If symptoms persist for more than three days, consult your doctor immediately. Do not take more than the recommended dose. Keep out of reach of children."
# Wrap instructions
wrapped_instructions <- str_wrap(instructions, width = 50)
ggplot(data.frame(x = 1, y = 1), aes(x, y, label = wrapped_instructions)) +
geom_label() +
theme_void()
The steps to clean the lima_messy
dataset would
involve:
lima_clean <- lima_messy %>%
mutate(
marital_status = str_squish(str_to_title(marital_status)), # Clean and standardize marital_status
sex = str_squish(str_to_upper(sex)), # Clean and standardize sex
age = str_squish(str_to_lower(age)) # Clean and standardize age
)
lima_clean %>%
select(marital_status, sex, age) %>%
tbl_summary()
Then, use the tbl_summary()
function to create the
summary table.
The following team members contributed to this lesson: