Understanding how to manipulate dates is a crucial skill in health data analysis. From patient admission dates to vaccination schedules, date-related data plays a vital role in epidemiological analyses. In this lesson, we will learn how R stores and displays dates, as well as how to effectively manipulate, parse, and format them. Let’s get started!
‣You understand how dates are stored and manipulated in R
‣You understand how to coerce strings to dates
‣You know how to handle messy dates
‣You are able to change how dates are displayed
Please load the packages needed for this lesson with the code below:
‣ Today we’ll work with two datasets.
‣ The first is about Indoor Residual Spraying (IRS) for malaria control in Illovo, Malawi.
## # A tibble: 112 × 9
## village target_spray sprayed coverage_p start_date_default end_date_default
## <chr> <dbl> <dbl> <dbl> <date> <date>
## 1 Mess 87 64 73.6 2014-04-07 2014-04-17
## 2 Nkombedzi 183 169 92.4 2014-04-22 2014-04-27
## 3 B Compou… 16 16 100 2014-05-13 2014-05-13
## 4 D Compou… 3 2 66.7 2014-05-13 2014-05-13
## 5 Post Off… 6 3 50 2014-05-13 2014-05-13
## 6 Mangulen… 375 372 99.2 2014-05-15 2014-05-26
## 7 Mangulen… 7 4 57.1 2014-05-27 2014-05-27
## 8 Old Scho… 24 23 95.8 2014-05-27 2014-05-27
## 9 Mwanza 671 636 94.8 2014-05-28 2014-06-16
## 10 Alumenda 226 226 100 2014-06-18 2014-06-27
## # ℹ 102 more rows
## # ℹ 3 more variables: start_date_typical <chr>, start_date_long <chr>,
## # start_date_messy <chr>
‣ The second is simulated data on hospital inpatient stays. It includes admission and discharge dates for 150 patients in different formats.
## # A tibble: 150 × 6
## patient_id adm_date_default adm_date_common adm_date_abbrev adm_date_messy
## <dbl> <date> <chr> <chr> <chr>
## 1 1 2021-05-23 05/23/2021 May 23, 2021 2021/05/23
## 2 2 2022-12-07 12/07/2022 Dec 07, 2022 12-07-2022
## 3 3 2022-03-27 03/27/2022 Mar 27, 2022 2022/03/27
## 4 4 2022-04-28 04/28/2022 Apr 28, 2022 28.04.2022
## 5 5 2023-06-28 06/28/2023 Jun 28, 2023 06-28-2023
## 6 6 2021-01-11 01/11/2021 Jan 11, 2021 01-11-2021
## 7 7 2022-04-10 04/10/2022 Apr 10, 2022 10.04.2022
## 8 8 2023-06-22 06/22/2023 Jun 22, 2023 2023/06/22
## 9 9 2020-10-21 10/21/2020 Oct 21, 2020 2020/10/21
## 10 10 2022-12-13 12/13/2022 Dec 13, 2022 12-13-2022
## # ℹ 140 more rows
## # ℹ 1 more variable: disch_date_default <date>
‣ In R, we use the Date
class to handle dates.
‣ Default format: “YYYY-MM-DD”
‣ If you just enter in such a date string, R will initially consider this to be a character:
## [1] "2020-01-01"
## [1] "character"
‣ To specify a date, we can use as.Date()
function.
## [1] "Date"
‣ With Date
, we can find the difference between
two dates.
## Time difference of 365 days
This would of course not be possible if you had bare characters:
"2020-01-01" - "2000-12-20"
‣ The default format for as.Date()
is “YYYY-MM-DD”.
‣ Other common formats like “MM/DD/YYYY” or “Month DD, YYYY” won’t work by default (later we’ll see how to “parse” such dates):
‣ However, R will also accept “/” instead of “-” as long as the order is still “YYYY/MM/DD”.
## [1] "2000-12-31"
‣ Let’s focus on how R classifies date variables in our IRS dataset.
‣ We’ll examine the columns containing dates and see how R reads them.
## # A tibble: 112 × 5
## start_date_default end_date_default start_date_typical start_date_long
## <date> <date> <chr> <chr>
## 1 2014-04-07 2014-04-17 07/04/2014 April 07 2014
## 2 2014-04-22 2014-04-27 22/04/2014 April 22 2014
## 3 2014-05-13 2014-05-13 13/05/2014 May 13 2014
## 4 2014-05-13 2014-05-13 13/05/2014 May 13 2014
## 5 2014-05-13 2014-05-13 13/05/2014 May 13 2014
## 6 2014-05-15 2014-05-26 15/05/2014 May 15 2014
## 7 2014-05-27 2014-05-27 27/05/2014 May 27 2014
## 8 2014-05-27 2014-05-27 27/05/2014 May 27 2014
## 9 2014-05-28 2014-06-16 28/05/2014 May 28 2014
## 10 2014-06-18 2014-06-27 18/06/2014 June 18 2014
## # ℹ 102 more rows
## # ℹ 1 more variable: start_date_messy <chr>
‣ To parse the remaining dates, we need to specify the date format. Why? Ambiguity.
‣ For example: 01-02-03
. Is it January 2nd, 2003, or
February 1st, 2003?
‣ As we saw, the function as.Date()
is used to convert
strings to dates in base R.
‣ Let’s try it without specifying the format of the date.
## # A tibble: 112 × 1
## start_date_typical
## <date>
## 1 0007-04-20
## 2 0022-04-20
## 3 0013-05-20
## 4 0013-05-20
## 5 0013-05-20
## 6 0015-05-20
## 7 0027-05-20
## 8 0027-05-20
## 9 0028-05-20
## 10 0018-06-20
## # ℹ 102 more rows
‣ As you see, not what we wanted. Original format was “DD/MM/YYYY”.
‣ To correctly interpret dates, we use the format
argument in as.Date()
‣ Symbols to specify date components:
Component | Symbol | Example |
---|---|---|
Year (numeric, with century) | %Y | 2023 |
Year (numeric, without century) | %y | 23 |
Month (numeric, 01-12) | %m | 01 |
Month (written out fully) | %B | January |
Month (abbreviated) | %b | Jan |
Day of the month | %d | 31 |
Day of the week (numeric, 1-7 with Sunday being 1) | %u | 5 |
Day of the week (written out fully) | %A | Friday |
Day of the week (abbreviated) | %a | Fri |
‣ Let’s try with format = "%d%m%Y"
irs %>%
select(start_date_typical) %>%
mutate(start_date_typical = as.Date(start_date_typical, format = "%d%m%Y"))
‣ Still did not work? Why?
‣ Format needs to include the slashes %d/%m/%Y
irs %>%
select(start_date_typical) %>%
mutate(start_date_typical = as.Date(start_date_typical, format = "%d/%m/%Y"))
## # A tibble: 112 × 1
## start_date_typical
## <date>
## 1 2014-04-07
## 2 2014-04-22
## 3 2014-05-13
## 4 2014-05-13
## 5 2014-05-13
## 6 2014-05-15
## 7 2014-05-27
## 8 2014-05-27
## 9 2014-05-28
## 10 2014-06-18
## # ℹ 102 more rows
PRACTICE TIME !
Coerce long date
Try to coerce the column start_date_long
from the IRS
dataset to the Date
class. Don’t forget to include all
elements into the format option, including the symbols that separate the
components of the date!
‣ The lubridate
package offers a user-friendly
way to convert strings to dates.
‣ Specify the order of year, month, and day using “y”, “m”, and “d”.
‣ Separators between components are not necessary.
## [1] "2021-09-08"
## [1] "2020-01-01"
## [1] "Date"
‣ Let’s apply these functions to the start_day_long
variable in our dataset.
PRACTICE TIME !
Coerce typical date
Try to coerce the column start_date_typical
from the IRS
dataset to the Date
class using the functions in the
lubridate package.
PRACTICE TIME !
Base and lubridate formatting
The following table contains the formats found in the
adm_date_abbr
and adm_date_messy
formats from
our inpatient dataset. See if you can fill in the blank cells:
Date example | Base R | Lubridate |
---|---|---|
Dec 07, 2022 | %b%d,%Y | mdy |
03-27-2022 | %m-%d-%Y | mdy |
28.04.2022 | %d.%m.%Y | dmy |
2024/11/13 | %Y/%m/%d | ymd |
‣ We’ve learned methods to convert strings to dates.
‣ But what about dealing with multiple date formats in one column?
lubridate::parse_date_time()
‣ Working with dates often involves multiple formats within a single column.
‣ lubridate
offers parse_date_time()
function to handle such scenarios.
‣ The function uses the orders
argument for specifying
multiple date formats.
‣ Let’s start with a basic example:
## [1] "2001-07-30 UTC"
‣ The function works well for single string inputs and is equivalent
to using dmy()
in this case.
‣ The power of parse_date_time()
is truly evident when
handling multiple formats.
‣ Example with two different formats:
## [1] "2000-01-01 UTC" "2001-07-30 UTC"
‣ Order of formats in orders
argument does not matter
when the date strings are distinct:
## [1] "2000-01-01 UTC" "2001-07-30 UTC"
‣ Ambiguity in dates can be a challenge.
‣ For ambiguous dates, the order in orders
does
matter.
## [1] "2000-01-02 UTC" "2000-01-03 UTC"
## [1] "2000-02-01 UTC" "2000-03-01 UTC"
PRACTICE TIME !
Using parse_date_time
The dates in the code below are November 9th 2002, December 4th 2001, and June 5th 2003. Complete the code to coerce them from strings to dates.
‣ Returning to our dataset, focus on the
start_date_messy
column.
‣ We are dealing with four different formats:
- YYYY/MM/DD
- Month DD YYYY
- DD-MM-YYYY
- MM/DD/YYYY
‣ Now that we know how our data is formatted, we can use the
parse_date_time()
function to clean it up.
irs %>%
select(start_date_messy) %>%
mutate(start_date_clean = parse_date_time(start_date_messy,
orders = c("mdy", "dmy", "ymd")))
start_date_messy | start_date_clean |
---|---|
04/07/14 | 2014-04-07 |
April 22 2014 | 2014-04-22 |
May 13 2014 | 2014-05-13 |
13-05-2014 | 2014-05-13 |
May 13 2014 | 2014-05-13 |
‣ Is order important here? Let’s try a different order and find out!
irs %>%
select(start_date_messy) %>%
mutate(start_date_messy_1 = parse_date_time(start_date_messy,
orders = c("mdy", "dmy", "ymd")),
start_date_messy_2 = parse_date_time(start_date_messy, orders = c("dmy", "mdy", "ymd")))
start_date_messy | start_date_messy_1 | start_date_messy_2 |
---|---|---|
04/07/14 | 2014-04-07 | 2014-04-07 |
April 22 2014 | 2014-04-22 | 2014-04-22 |
May 13 2014 | 2014-05-13 | 2014-05-13 |
13-05-2014 | 2014-05-13 | 2014-05-13 |
May 13 2014 | 2014-05-13 | 2014-05-13 |
‣ If you scroll through, you will be able to observe that the conversions were the same.
‣ Why? Because there is no ambiguity in the dataset, and the order of
formats provided to orders
only matter when we have
ambiguous dates.
‣ You may think that there IS ambiguity: perhaps, the first date could be July 4, or April 7:
## # A tibble: 1 × 1
## start_date_messy
## <chr>
## 1 04/07/14
‣ But {lubridate} read it as April 7 (“mdy”) in both cases, even when
“dmy” was provided first for the start_date_messy_2
column?
‣ This is because there are other rows that resolve this amibiguity. For example, row 11 is “06/26/14”:
## # A tibble: 2 × 1
## start_date_messy
## <chr>
## 1 04/07/14
## 2 06/26/14
‣ This row can ONLY be interpreted as June 26. The number 26 cannot represent a month.
‣ So lubridate knows that all rows of the format “XX/XX/XX” in the dataset are “mm/dd/yy”
Using parse_date_time with adm_date_messy
With the help of the table you completed from the exercise, use the
parse_date_time()
function to clean up the
adm_date_messy
column in the inpatient dataset,
ip
!
ip %>%
select(adm_date_messy) %>%
mutate(adm_date_messy = parse_date_time(adm_date_messy,
orders = c("ymd", "mdy", "dmy")))
## # A tibble: 150 × 1
## adm_date_messy
## <dttm>
## 1 2021-05-23 00:00:00
## 2 2022-12-07 00:00:00
## 3 2022-03-27 00:00:00
## 4 2022-04-28 00:00:00
## 5 2023-06-28 00:00:00
## 6 2021-01-11 00:00:00
## 7 2022-04-10 00:00:00
## 8 2023-06-22 00:00:00
## 9 2020-10-21 00:00:00
## 10 2022-12-13 00:00:00
## # ℹ 140 more rows
‣ Up to now, we’ve been using the Date
class with a
default “YYYY-MM-DD” format.
‣ But, how can we display dates in a specific format for reports or graphs?
‣ We can use the format()
function in base R.
‣ This function gives us the freedom to customize the appearance of dates.
‣ Here’s a table to remind us how different date parts are represented:
Component | Symbol | Example |
---|---|---|
Year (numeric, with century) | %Y | 2023 |
Year (numeric, without century) | %y | 23 |
Month (numeric, 01-12) | %m | 01 |
Month (written out fully) | %B | January |
Month (abbreviated) | %b | Jan |
Day of the month | %d | 31 |
Day of the week (numeric, 1-7 with Sunday being 1) | %u | 5 |
Day of the week (written out fully) | %A | Friday |
Day of the week (abbreviated) | %a | Fri |
‣ Let’s try a single example:
‣ We want to convert 2000-01-31
to “Jan 31, 2000”.
## [1] "2000-01-31"
## [1] "Jan 31, 2000"
PRACTICE TIME !
Format date vector
Format the date below to MM/DD/YYYY using the format function:
## [1] "05/07/2018"
‣ Now, let’s apply this on our dataset:
‣ Creating a new variable start_date_char
from
start_date_default
.
‣ We’ll display it as DD/MM/YYYY.
irs %>%
select(start_date_default) %>%
mutate(start_date_char = format(start_date_default, "%d %b %Y")) %>%
select(start_date_default, start_date_char)
‣ One last example:
‣ Using end_date_default
and formatting it as
Month DD, YYYY.
irs %>%
select(end_date_default) %>%
mutate(end_date_char = format(end_date_default, "%B %d, %Y")) %>%
select(end_date_default, end_date_char)
‣ Congratulations on finishing the first Dates lesson!
‣ You now understand how Dates are stored, displayed, and formatted in R.
‣ Next, we’ll learn how to perform manipulations with dates and create basic time series graphs.
Coerce long date
Find code errors
Coerce typical date
Base and lubridate formatting
Date example | Base R | Lubridate |
---|---|---|
Dec 07, 2022 | %b %d, %Y | mdy |
03-27-2022 | %m-%d-%Y | mdy |
28.04.2022 | %d.%m.%Y | dmy |
2021/05/23 | %Y/%m/%d | ymd |
Using parse_date_time
Using parse_date_time with adm_date_messy
Format date vector
The following team members contributed to this lesson: