1 Intro

Understanding how to manipulate dates is a crucial skill in health data analysis. From patient admission dates to vaccination schedules, date-related data plays a vital role in epidemiological analyses. In this lesson, we will learn how R stores and displays dates, as well as how to effectively manipulate, parse, and format them. Let’s get started!

2 Learning Objectives

‣You understand how dates are stored and manipulated in R

‣You understand how to coerce strings to dates

‣You know how to handle messy dates

‣You are able to change how dates are displayed

3 Packages

Please load the packages needed for this lesson with the code below:

if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, 
               lubridate)

3.1 Datasets

‣ Today we’ll work with two datasets.

‣ The first is about Indoor Residual Spraying (IRS) for malaria control in Illovo, Malawi.

irs <- read_csv(here("data/Illovo_data.csv"))
irs
## # A tibble: 112 × 9
##    village   target_spray sprayed coverage_p start_date_default end_date_default
##    <chr>            <dbl>   <dbl>      <dbl> <date>             <date>          
##  1 Mess                87      64       73.6 2014-04-07         2014-04-17      
##  2 Nkombedzi          183     169       92.4 2014-04-22         2014-04-27      
##  3 B Compou…           16      16      100   2014-05-13         2014-05-13      
##  4 D Compou…            3       2       66.7 2014-05-13         2014-05-13      
##  5 Post Off…            6       3       50   2014-05-13         2014-05-13      
##  6 Mangulen…          375     372       99.2 2014-05-15         2014-05-26      
##  7 Mangulen…            7       4       57.1 2014-05-27         2014-05-27      
##  8 Old Scho…           24      23       95.8 2014-05-27         2014-05-27      
##  9 Mwanza             671     636       94.8 2014-05-28         2014-06-16      
## 10 Alumenda           226     226      100   2014-06-18         2014-06-27      
## # ℹ 102 more rows
## # ℹ 3 more variables: start_date_typical <chr>, start_date_long <chr>,
## #   start_date_messy <chr>

‣ The second is simulated data on hospital inpatient stays. It includes admission and discharge dates for 150 patients in different formats.

ip <- read_csv(here("data/inpatient_data.csv"))
ip
## # A tibble: 150 × 6
##    patient_id adm_date_default adm_date_common adm_date_abbrev adm_date_messy
##         <dbl> <date>           <chr>           <chr>           <chr>         
##  1          1 2021-05-23       05/23/2021      May 23, 2021    2021/05/23    
##  2          2 2022-12-07       12/07/2022      Dec 07, 2022    12-07-2022    
##  3          3 2022-03-27       03/27/2022      Mar 27, 2022    2022/03/27    
##  4          4 2022-04-28       04/28/2022      Apr 28, 2022    28.04.2022    
##  5          5 2023-06-28       06/28/2023      Jun 28, 2023    06-28-2023    
##  6          6 2021-01-11       01/11/2021      Jan 11, 2021    01-11-2021    
##  7          7 2022-04-10       04/10/2022      Apr 10, 2022    10.04.2022    
##  8          8 2023-06-22       06/22/2023      Jun 22, 2023    2023/06/22    
##  9          9 2020-10-21       10/21/2020      Oct 21, 2020    2020/10/21    
## 10         10 2022-12-13       12/13/2022      Dec 13, 2022    12-13-2022    
## # ℹ 140 more rows
## # ℹ 1 more variable: disch_date_default <date>

3.2 Introduction to dates in R

‣ In R, we use the Date class to handle dates.

‣ Default format: “YYYY-MM-DD”

‣ If you just enter in such a date string, R will initially consider this to be a character:

"2020-01-01"
## [1] "2020-01-01"
class("2020-01-01")
## [1] "character"

‣ To specify a date, we can use as.Date() function.

my_date <- as.Date("2020-01-01")
class(my_date)
## [1] "Date"

‣ With Date, we can find the difference between two dates.

as.Date("2020-01-01") - as.Date("2019-01-01")
## Time difference of 365 days

This would of course not be possible if you had bare characters:

"2020-01-01" - "2000-12-20"

‣ The default format for as.Date() is “YYYY-MM-DD”.

‣ Other common formats like “MM/DD/YYYY” or “Month DD, YYYY” won’t work by default (later we’ll see how to “parse” such dates):

as.Date("12/31/2000") 
as.Date("Dec 31, 2000")

‣ However, R will also accept “/” instead of “-” as long as the order is still “YYYY/MM/DD”.

as.Date("2000/12/31")
## [1] "2000-12-31"

4 Coercing (parsing) strings to dates

‣ Let’s focus on how R classifies date variables in our IRS dataset.

‣ We’ll examine the columns containing dates and see how R reads them.

irs %>% 
  select(contains("date"))
## # A tibble: 112 × 5
##    start_date_default end_date_default start_date_typical start_date_long
##    <date>             <date>           <chr>              <chr>          
##  1 2014-04-07         2014-04-17       07/04/2014         April 07 2014  
##  2 2014-04-22         2014-04-27       22/04/2014         April 22 2014  
##  3 2014-05-13         2014-05-13       13/05/2014         May 13 2014    
##  4 2014-05-13         2014-05-13       13/05/2014         May 13 2014    
##  5 2014-05-13         2014-05-13       13/05/2014         May 13 2014    
##  6 2014-05-15         2014-05-26       15/05/2014         May 15 2014    
##  7 2014-05-27         2014-05-27       27/05/2014         May 27 2014    
##  8 2014-05-27         2014-05-27       27/05/2014         May 27 2014    
##  9 2014-05-28         2014-06-16       28/05/2014         May 28 2014    
## 10 2014-06-18         2014-06-27       18/06/2014         June 18 2014   
## # ℹ 102 more rows
## # ℹ 1 more variable: start_date_messy <chr>

‣ To parse the remaining dates, we need to specify the date format. Why? Ambiguity.

‣ For example: 01-02-03. Is it January 2nd, 2003, or February 1st, 2003?


4.1 Using Base R for Date coercion (parsing)

‣ As we saw, the function as.Date() is used to convert strings to dates in base R.

‣ Let’s try it without specifying the format of the date.

irs %>% 
  select(start_date_typical) %>% 
  mutate(start_date_typical = as.Date(start_date_typical))
## # A tibble: 112 × 1
##    start_date_typical
##    <date>            
##  1 0007-04-20        
##  2 0022-04-20        
##  3 0013-05-20        
##  4 0013-05-20        
##  5 0013-05-20        
##  6 0015-05-20        
##  7 0027-05-20        
##  8 0027-05-20        
##  9 0028-05-20        
## 10 0018-06-20        
## # ℹ 102 more rows

‣ As you see, not what we wanted. Original format was “DD/MM/YYYY”.


‣ To correctly interpret dates, we use the format argument in as.Date()

‣ Symbols to specify date components:

Component Symbol Example
Year (numeric, with century) %Y 2023
Year (numeric, without century) %y 23
Month (numeric, 01-12) %m 01
Month (written out fully) %B January
Month (abbreviated) %b Jan
Day of the month %d 31
Day of the week (numeric, 1-7 with Sunday being 1) %u 5
Day of the week (written out fully) %A Friday
Day of the week (abbreviated) %a Fri

‣ Let’s try with format = "%d%m%Y"

irs %>%
  select(start_date_typical) %>% 
  mutate(start_date_typical = as.Date(start_date_typical, format = "%d%m%Y"))

‣ Still did not work? Why?

‣ Format needs to include the slashes %d/%m/%Y

irs %>%
  select(start_date_typical) %>% 
  mutate(start_date_typical = as.Date(start_date_typical, format = "%d/%m/%Y"))
## # A tibble: 112 × 1
##    start_date_typical
##    <date>            
##  1 2014-04-07        
##  2 2014-04-22        
##  3 2014-05-13        
##  4 2014-05-13        
##  5 2014-05-13        
##  6 2014-05-15        
##  7 2014-05-27        
##  8 2014-05-27        
##  9 2014-05-28        
## 10 2014-06-18        
## # ℹ 102 more rows

PRACTICE TIME !

Coerce long date

Try to coerce the column start_date_long from the IRS dataset to the Date class. Don’t forget to include all elements into the format option, including the symbols that separate the components of the date!

# Complete the code with your answer:
irs %>%
  select(start_date_long) %>%
  mutate(start_date_long = as.Date(start_date_long, format= "%B%d%Y"))

Find code errors

Can you find all the errors in the following code?

as.Date("June 26, 1987", format = "%B%d,%Y")
## [1] "1987-06-26"

5 Lubridate

‣ The lubridate package offers a user-friendly way to convert strings to dates.

‣ Specify the order of year, month, and day using “y”, “m”, and “d”.

‣ Separators between components are not necessary.


# Example conversions
dmy("8 Sep 2021")
## [1] "2021-09-08"
mdy("Jan 01 2020")
## [1] "2020-01-01"
class(ymd("2023 May 23"))
## [1] "Date"

‣ Let’s apply these functions to the start_day_long variable in our dataset.

irs %>%
  select(start_date_long) %>%  
  mutate(start_date_long_conv = mdy(start_date_long)) 

PRACTICE TIME !

Coerce typical date

Try to coerce the column start_date_typical from the IRS dataset to the Date class using the functions in the lubridate package.

irs %>% 
  select(start_date_typical) %>% 
  mutate(start_date_typical = dmy(start_date_typical))

PRACTICE TIME !

Base and lubridate formatting

The following table contains the formats found in the adm_date_abbr and adm_date_messy formats from our inpatient dataset. See if you can fill in the blank cells:

Date example Base R Lubridate
Dec 07, 2022 %b%d,%Y mdy
03-27-2022 %m-%d-%Y mdy
28.04.2022 %d.%m.%Y dmy
2024/11/13 %Y/%m/%d ymd

‣ We’ve learned methods to convert strings to dates.

‣ But what about dealing with multiple date formats in one column?


5.1 Handling Messy Dates with lubridate::parse_date_time()

‣ Working with dates often involves multiple formats within a single column.

lubridate offers parse_date_time() function to handle such scenarios.

‣ The function uses the orders argument for specifying multiple date formats.


‣ Let’s start with a basic example:

parse_date_time("30/07/2001", orders = "dmy")
## [1] "2001-07-30 UTC"

‣ The function works well for single string inputs and is equivalent to using dmy() in this case.


‣ The power of parse_date_time() is truly evident when handling multiple formats.

‣ Example with two different formats:

parse_date_time(c("1 Jan 2000", "July 30th 2001"), orders = c("dmy", "mdy"))
## [1] "2000-01-01 UTC" "2001-07-30 UTC"

‣ Order of formats in orders argument does not matter when the date strings are distinct:

parse_date_time(c("1 Jan 2000", "July 30th 2001"), orders = c("mdy", "dmy"))
## [1] "2000-01-01 UTC" "2001-07-30 UTC"

‣ Ambiguity in dates can be a challenge.

‣ For ambiguous dates, the order in orders does matter.

parse_date_time(c("01/02/2000", "01/03/2000"), orders=c("mdy", "dmy"))
## [1] "2000-01-02 UTC" "2000-01-03 UTC"
parse_date_time(c("01/02/2000", "01/03/2000"), orders=c("dmy", "mdy"))
## [1] "2000-02-01 UTC" "2000-03-01 UTC"

PRACTICE TIME !

Using parse_date_time

The dates in the code below are November 9th 2002, December 4th 2001, and June 5th 2003. Complete the code to coerce them from strings to dates.

parse_date_time(c("11/09/2002", "12/04/2001", "2003-06-05"), orders=c("mdy", "mdy", "ymd"))

‣ Returning to our dataset, focus on the start_date_messy column.

‣ We are dealing with four different formats:

- YYYY/MM/DD

- Month DD YYYY

- DD-MM-YYYY

- MM/DD/YYYY

‣ Now that we know how our data is formatted, we can use the parse_date_time() function to clean it up.

irs %>%
  select(start_date_messy) %>%
  mutate(start_date_clean = parse_date_time(start_date_messy, 
                                        orders = c("mdy", "dmy", "ymd")))
start_date_messy start_date_clean
04/07/14 2014-04-07
April 22 2014 2014-04-22
May 13 2014 2014-05-13
13-05-2014 2014-05-13
May 13 2014 2014-05-13

‣ Is order important here? Let’s try a different order and find out!

irs %>%
  select(start_date_messy) %>%
  mutate(start_date_messy_1 = parse_date_time(start_date_messy, 
                                              orders = c("mdy", "dmy", "ymd")),
         start_date_messy_2 = parse_date_time(start_date_messy, orders = c("dmy", "mdy", "ymd")))
start_date_messy start_date_messy_1 start_date_messy_2
04/07/14 2014-04-07 2014-04-07
April 22 2014 2014-04-22 2014-04-22
May 13 2014 2014-05-13 2014-05-13
13-05-2014 2014-05-13 2014-05-13
May 13 2014 2014-05-13 2014-05-13

‣ If you scroll through, you will be able to observe that the conversions were the same.

‣ Why? Because there is no ambiguity in the dataset, and the order of formats provided to orders only matter when we have ambiguous dates.

‣ You may think that there IS ambiguity: perhaps, the first date could be July 4, or April 7:

irs %>%
  select(start_date_messy) %>% 
  slice(1)
## # A tibble: 1 × 1
##   start_date_messy
##   <chr>           
## 1 04/07/14

‣ But {lubridate} read it as April 7 (“mdy”) in both cases, even when “dmy” was provided first for the start_date_messy_2 column?

‣ This is because there are other rows that resolve this amibiguity. For example, row 11 is “06/26/14”:

irs %>%
  select(start_date_messy) %>% 
  slice(1, 11)
## # A tibble: 2 × 1
##   start_date_messy
##   <chr>           
## 1 04/07/14        
## 2 06/26/14

‣ This row can ONLY be interpreted as June 26. The number 26 cannot represent a month.

‣ So lubridate knows that all rows of the format “XX/XX/XX” in the dataset are “mm/dd/yy”

Using parse_date_time with adm_date_messy

With the help of the table you completed from the exercise, use the parse_date_time() function to clean up the adm_date_messy column in the inpatient dataset, ip!

ip %>% 
  select(adm_date_messy) %>% 
  mutate(adm_date_messy = parse_date_time(adm_date_messy,
                                          orders = c("ymd", "mdy", "dmy")))
## # A tibble: 150 × 1
##    adm_date_messy     
##    <dttm>             
##  1 2021-05-23 00:00:00
##  2 2022-12-07 00:00:00
##  3 2022-03-27 00:00:00
##  4 2022-04-28 00:00:00
##  5 2023-06-28 00:00:00
##  6 2021-01-11 00:00:00
##  7 2022-04-10 00:00:00
##  8 2023-06-22 00:00:00
##  9 2020-10-21 00:00:00
## 10 2022-12-13 00:00:00
## # ℹ 140 more rows

5.2 Changing how dates are displayed

‣ Up to now, we’ve been using the Date class with a default “YYYY-MM-DD” format.

‣ But, how can we display dates in a specific format for reports or graphs?

‣ We can use the format() function in base R.

‣ This function gives us the freedom to customize the appearance of dates.


‣ Here’s a table to remind us how different date parts are represented:

Component Symbol Example
Year (numeric, with century) %Y 2023
Year (numeric, without century) %y 23
Month (numeric, 01-12) %m 01
Month (written out fully) %B January
Month (abbreviated) %b Jan
Day of the month %d 31
Day of the week (numeric, 1-7 with Sunday being 1) %u 5
Day of the week (written out fully) %A Friday
Day of the week (abbreviated) %a Fri

‣ Let’s try a single example:

‣ We want to convert 2000-01-31 to “Jan 31, 2000”.

my_date <- as.Date("2000-01-31")

my_date
## [1] "2000-01-31"
format(my_date, "%b %d, %Y")
## [1] "Jan 31, 2000"

PRACTICE TIME !

Format date vector

Format the date below to MM/DD/YYYY using the format function:

my_date <- as.Date("2018-05-07")

format(my_date, "%m/%d/%Y")
## [1] "05/07/2018"

‣ Now, let’s apply this on our dataset:

‣ Creating a new variable start_date_char from start_date_default.

‣ We’ll display it as DD/MM/YYYY.

irs %>% 
  select(start_date_default) %>% 
  mutate(start_date_char = format(start_date_default, "%d %b %Y")) %>% 
  select(start_date_default, start_date_char)

‣ One last example:

‣ Using end_date_default and formatting it as Month DD, YYYY.

irs %>% 
  select(end_date_default) %>% 
  mutate(end_date_char = format(end_date_default, "%B %d, %Y")) %>% 
  select(end_date_default, end_date_char)

5.3 Wrap Up!

‣ Congratulations on finishing the first Dates lesson!

‣ You now understand how Dates are stored, displayed, and formatted in R.

‣ Next, we’ll learn how to perform manipulations with dates and create basic time series graphs.


Answer Key

Coerce long date

irs <- irs %>%
  mutate(start_date_long = as.Date(start_date_long, format="%B %d %Y"))

Find code errors

as.Date("June 26, 1987", format = "%B %d, %Y")

Coerce typical date

irs %>%
  mutate(start_date_typical = dmy(start_date_typical))

Base and lubridate formatting

Date example Base R Lubridate
Dec 07, 2022 %b %d, %Y mdy
03-27-2022 %m-%d-%Y mdy
28.04.2022 %d.%m.%Y dmy
2021/05/23 %Y/%m/%d ymd

Using parse_date_time

parse_date_time(c("11/09/2002", "12/04/2001", "2003-06-05"), orders=c("mdy", "ymd"))

Using parse_date_time with adm_date_messy

ip %>%
  mutate(adm_date_messy = parse_date_time(adm_date_messy, orders = c("mdy", "dmy", "ymd")))

Format date vector

my_date <- as.Date("2018-05-07")
format(my_date, "%m/%d/%Y")

Contributors

The following team members contributed to this lesson: