You now have a solid understanding of how dates are stored, displayed, and formatted in R. In this lesson, you will learn how to perform simple analyses with dates, such as calculating the time between date intervals and creating time series graphs! These skills are crucial for anyone working with health data, as they are the basis to understanding temporal patterns such as the progression of diseases over time and the fluctuation in population health metrics across different periods.
You know how to calculate intervals between dates
You know how to extract components from date columns
You know how to round dates
You are able to create simple time series graphs
Please load the packages needed for this lesson with the code below:
‣ We will be working with two datasets related to indoor residual spraying (IRS) for malaria control efforts in Illovo, Malawi.
‣ The first dataset provides the start and end dates of mosquito spraying campaigns in different villages.
## # A tibble: 112 × 9
## village target_spray sprayed coverage_p start_date_default end_date_default
## <chr> <dbl> <dbl> <dbl> <date> <date>
## 1 Mess 87 64 73.6 2014-04-07 2014-04-17
## 2 Nkombedzi 183 169 92.4 2014-04-22 2014-04-27
## 3 B Compou… 16 16 100 2014-05-13 2014-05-13
## 4 D Compou… 3 2 66.7 2014-05-13 2014-05-13
## 5 Post Off… 6 3 50 2014-05-13 2014-05-13
## 6 Mangulen… 375 372 99.2 2014-05-15 2014-05-26
## 7 Mangulen… 7 4 57.1 2014-05-27 2014-05-27
## 8 Old Scho… 24 23 95.8 2014-05-27 2014-05-27
## 9 Mwanza 671 636 94.8 2014-05-28 2014-06-16
## 10 Alumenda 226 226 100 2014-06-18 2014-06-27
## # ℹ 102 more rows
## # ℹ 3 more variables: start_date_typical <chr>, start_date_long <chr>,
## # start_date_messy <chr>
‣ The second dataset gives monthly data from 2015-2019 comparing the average incidence of malaria per 1000 people.
‣ It contrasts villages that received IRS against villages that didn’t.
## # A tibble: 48 × 5
## date ir_case ir_control avg_min avg_max
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2015-01-10 42.9 19.6 21.2 31.6
## 2 2015-02-03 61.0 10.1 21.5 32.9
## 3 2015-03-11 74.1 56.8 20.6 33.4
## 4 2015-04-15 95.2 34.7 18.5 32.3
## 5 2015-05-05 89.8 31.9 15.9 31.4
## 6 2015-06-22 59.8 22.6 14.0 29.1
## 7 2015-07-18 36.0 21.2 13.4 29.9
## 8 2015-08-19 32.4 16.6 14.4 30.8
## 9 2015-09-22 42.0 23.0 16.9 34.1
## 10 2015-10-14 17.4 19.8 20.5 36.2
## # ℹ 38 more rows
## [1] "date" "ir_case" "ir_control" "avg_min" "avg_max"
‣ Columns:
‣ ir_case
: Malaria incidence in IRS villages
‣ ir_control
: Malaria incidence in non-IRS villages
‣ date
: Contains the month and random day
‣ Average monthly minimum and maximum temperatures
(avg_min
and avg_max
)
‣ The final dataset has 1,460 rows of daily weather data for the same Illovo region.
## # A tibble: 1,460 × 4
## date min_temp max_temp rain
## <date> <dbl> <dbl> <dbl>
## 1 2015-01-01 21.5 29.9 21.7
## 2 2015-01-02 19.6 30.4 2.2
## 3 2015-01-03 21.6 29.9 25.8
## 4 2015-01-04 20 29.5 1
## 5 2015-01-05 20 32.2 53
## 6 2015-01-06 21.8 31.1 60
## 7 2015-01-07 21 28.7 44.4
## 8 2015-01-08 22 29.5 30
## 9 2015-01-09 22 31 3
## 10 2015-01-10 19.9 31.6 0
## # ℹ 1,450 more rows
‣ Each row signifies a single day and offers measurements of:
‣ Minimum temperature (min_temp
) in Celsius
‣ Maximum temperature (max_temp
) in Celsius
‣ Rainfall (rain
) in millimeters
‣ To begin, we’ll explore two ways to calculate intervals.
‣ The first uses the “-” operator in base R.
‣ The second utilizes the interval operator from the {lubridate} package.
‣ Let’s examine both methods and see how they differ.
‣ This approach calculates time differences by simply subtracting one date from another.
‣ Let’s craft two date variables and test this out!
date_1 <- as.Date("2020-01-01") # January 1st, 2000
date_2 <- as.Date("2020-01-31") # January 31st, 2000
# subtract the dates
date_2 - date_1
## Time difference of 30 days
‣ And there we have it! R displays the time difference in days.
‣ Let’s see a second way to calculate time intervals
‣ We’ll use the %--%
operator from the {lubridate}
package.
‣ This operator is sometimes called the interval operator.
## [1] 2020-01-31 UTC--2020-01-01 UTC
‣ The output shows an interval between two dates.
‣ But what if we want to know how long has passed in days?
‣ For this, we need to use the days()
function.
‣ Dividing by days(1)
will tell lubridate to count in
increments of one day at a time.
## [1] 30
‣ Leaving the parentheses empty, i.e., days()
, would
also work. This is because lubridate’s default is to
count in increments of 1.
‣ But let’s say we want to count in increments of 5 days.
‣ We’d specify days(5)
## [1] 6
‣ So which of the methods is best?
‣ Lubridate provides more flexibility and accuracy when working with dates in R.
‣ Let’s look at a simple example to see why.
‣ First, we’ll set two dates that are 6 years apart:
date_1 <- as.Date("2000-01-01") # January 1st, 2000
date_2 <- as.Date("2006-01-01") # January 1st, 2006
‣ How to calculate the years passed between these dates in base R?
‣ Subtract the two dates, date_2 - date_1
‣ Then, divide by an average day count, like 365.25 (accounting for leap years)
## [1] 6.001369
‣ Result is close to 6 but imprecise due to the averaging of leap years!
‣ (Can remove “days” by converting to numeric)
‣ Dividing by 365 or 366 will also give imprecise results:
## Time difference of 6.005479 days
## Time difference of 5.989071 days
‣ Need to account for two leap years (two extra days) between the dates
‣ Subtract those two days out first:
## [1] 6.005479
‣ Painful for real data!
‣ With lubridate intervals, process is more straightforward:
‣ Leap years are handled for you
## [1] 6.005479
‣ Small difference, but lubridate is the winner here.
‣ Also better at handling daylight savings with date-times.
Lubridate intervals
Can you apply lubridate’s interval function to our IRS dataset?
Create a new column called spraying_time
and using
lubridates %--%
operator, calculate the number of days
between start_date_default
and
end_date_default
.
## # A tibble: 112 × 10
## village target_spray sprayed coverage_p start_date_default end_date_default
## <chr> <dbl> <dbl> <dbl> <date> <date>
## 1 Mess 87 64 73.6 2014-04-07 2014-04-17
## 2 Nkombedzi 183 169 92.4 2014-04-22 2014-04-27
## 3 B Compou… 16 16 100 2014-05-13 2014-05-13
## 4 D Compou… 3 2 66.7 2014-05-13 2014-05-13
## 5 Post Off… 6 3 50 2014-05-13 2014-05-13
## 6 Mangulen… 375 372 99.2 2014-05-15 2014-05-26
## 7 Mangulen… 7 4 57.1 2014-05-27 2014-05-27
## 8 Old Scho… 24 23 95.8 2014-05-27 2014-05-27
## 9 Mwanza 671 636 94.8 2014-05-28 2014-06-16
## 10 Alumenda 226 226 100 2014-06-18 2014-06-27
## # ℹ 102 more rows
## # ℹ 4 more variables: start_date_typical <chr>, start_date_long <chr>,
## # start_date_messy <chr>, spraying_time <dbl>
‣ Lubridate has a technical distinction between “intervals”, “periods” and “durations”.
‣ You can find out more here: STA 444/5 - Introductory Data Science using R
‣ During data cleaning or analysis, sometimes you need to extract a specific component of your date variable.
‣ {lubridate} package offers a set of useful functions for this.
‣ For example, to create a column with just the month of spraying,
use the month()
function.
irs %>%
mutate(month_start = month(start_date_default)) %>%
select(village, start_date_default, month_start)
## # A tibble: 112 × 3
## village start_date_default month_start
## <chr> <date> <dbl>
## 1 Mess 2014-04-07 4
## 2 Nkombedzi 2014-04-22 4
## 3 B Compound 2014-05-13 5
## 4 D Compound 2014-05-13 5
## 5 Post Office 2014-05-13 5
## 6 Mangulenje 2014-05-15 5
## 7 Mangulenje Senior 2014-05-27 5
## 8 Old School 2014-05-27 5
## 9 Mwanza 2014-05-28 5
## 10 Alumenda 2014-06-18 6
## # ℹ 102 more rows
‣ The function returns the month as a number from 1-12.
‣ If you want R to display the month’s name, use
label=TRUE
argument.
irs %>%
mutate(month_start = month(start_date_default, label=T)) %>%
select(village, start_date_default, month_start)
## # A tibble: 112 × 3
## village start_date_default month_start
## <chr> <date> <ord>
## 1 Mess 2014-04-07 Apr
## 2 Nkombedzi 2014-04-22 Apr
## 3 B Compound 2014-05-13 May
## 4 D Compound 2014-05-13 May
## 5 Post Office 2014-05-13 May
## 6 Mangulenje 2014-05-15 May
## 7 Mangulenje Senior 2014-05-27 May
## 8 Old School 2014-05-27 May
## 9 Mwanza 2014-05-28 May
## 10 Alumenda 2014-06-18 Jun
## # ℹ 102 more rows
‣ Similarly, to extract the year, use the
year()
function.
irs %>%
mutate(year_start = year(start_date_default)) %>%
select(village, start_date_default, year_start)
## # A tibble: 112 × 3
## village start_date_default year_start
## <chr> <date> <dbl>
## 1 Mess 2014-04-07 2014
## 2 Nkombedzi 2014-04-22 2014
## 3 B Compound 2014-05-13 2014
## 4 D Compound 2014-05-13 2014
## 5 Post Office 2014-05-13 2014
## 6 Mangulenje 2014-05-15 2014
## 7 Mangulenje Senior 2014-05-27 2014
## 8 Old School 2014-05-27 2014
## 9 Mwanza 2014-05-28 2014
## 10 Alumenda 2014-06-18 2014
## # ℹ 102 more rows
Extracting weekdays
Create a new variable called wday_start
and extract the
day of the week that the spraying started in the same way as above but
with the wday()
function. Try to display the days of the
week written out rather than numerically.
irs %>%
mutate(wday_start = wday(start_date_default, label = TRUE)) %>%
select(village, start_date_default, wday_start)
## # A tibble: 112 × 3
## village start_date_default wday_start
## <chr> <date> <ord>
## 1 Mess 2014-04-07 Mon
## 2 Nkombedzi 2014-04-22 Tue
## 3 B Compound 2014-05-13 Tue
## 4 D Compound 2014-05-13 Tue
## 5 Post Office 2014-05-13 Tue
## 6 Mangulenje 2014-05-15 Thu
## 7 Mangulenje Senior 2014-05-27 Tue
## 8 Old School 2014-05-27 Tue
## 9 Mwanza 2014-05-28 Wed
## 10 Alumenda 2014-06-18 Wed
## # ℹ 102 more rows
‣ Often, you’ll extract specific date components for visualization.
‣ For instance, to visualize the months when spraying starts:
‣ First, create a new month variable using month()
.
‣ Then, plot a bar graph with geom_bar
.
irs %>%
mutate(month = month(start_date_default, label=T)) %>%
# then pass to ggplot:
ggplot() +
geom_bar(aes(x= month))
‣ Most spraying campaigns began between July and November. No campaigns in the first three months of the year.
Visualizing spray end months
Using the irs
dataset, create a new graph showing the
months when the spraying campaign ended and compare it to the graph of
when they started. Do they have a similar pattern?
irs %>%
mutate(month_end = month(end_date_default, label=T)) %>%
# then pass to ggplot:
ggplot() +
geom_bar(aes(x= month_end))
‣ We often round dates up or down for analysis or visualization.
‣ Let’s see what we mean by rounding with a few examples.
‣ Consider the date: March 17th 2012.
‣ If we want to round down to the nearest month, we
use the floor_date()
function from
{lubridate}
.
‣ with unit="month"
.
## [1] "2012-03-01"
‣ As we observe, our date becomes March 1st, 2012.
‣ Now let’s round up.
‣ Consider the date: January 3rd 2020.
‣ To round up, we use the ceiling_date()
function.
## [1] "2020-02-01"
‣ With ceiling_date()
, January 3rd becomes
February 1st.
‣ We can also round without specifying up or down.
‣ The dates automatically round to the nearest specified unit.
## [1] "2000-11-01" "2000-12-01"
‣ Here, by rounding to the nearest month:
‣ November 3rd becomes November 1st
‣ November 27th becomes December 1st.
Rounding dates practice
We can also round up or down to the nearest year. What do you think the output would be if we round down the date November 29th 2001 to the nearest year:
‣ Let’s see how rounding can be useful!
‣ Consider our weather data.
## # A tibble: 1,460 × 4
## date min_temp max_temp rain
## <date> <dbl> <dbl> <dbl>
## 1 2015-01-01 21.5 29.9 21.7
## 2 2015-01-02 19.6 30.4 2.2
## 3 2015-01-03 21.6 29.9 25.8
## 4 2015-01-04 20 29.5 1
## 5 2015-01-05 20 32.2 53
## 6 2015-01-06 21.8 31.1 60
## 7 2015-01-07 21 28.7 44.4
## 8 2015-01-08 22 29.5 30
## 9 2015-01-09 22 31 3
## 10 2015-01-10 19.9 31.6 0
## # ℹ 1,450 more rows
‣ The data you see is daily weather data.
‣ Daily data can be noisy due to day-to-day variation.
‣ We want to look at seasonal patterns; monthly averages might be more suitable.
‣ How do we do this? Let’s try aggregating by month using the
str_sub()
function.
## # A tibble: 1,460 × 5
## date min_temp max_temp rain month_year
## <date> <dbl> <dbl> <dbl> <chr>
## 1 2015-01-01 21.5 29.9 21.7 2015-01
## 2 2015-01-02 19.6 30.4 2.2 2015-01
## 3 2015-01-03 21.6 29.9 25.8 2015-01
## 4 2015-01-04 20 29.5 1 2015-01
## 5 2015-01-05 20 32.2 53 2015-01
## 6 2015-01-06 21.8 31.1 60 2015-01
## 7 2015-01-07 21 28.7 44.4 2015-01
## 8 2015-01-08 22 29.5 30 2015-01
## 9 2015-01-09 22 31 3 2015-01
## 10 2015-01-10 19.9 31.6 0 2015-01
## # ℹ 1,450 more rows
‣ Now, we’ll group by month_year
and calculate the
average rainfall.
weather_summary_1 <- weather %>%
mutate(month_year = str_sub(date, 1, 7)) %>%
group_by(month_year) %>%
summarise(avg_rain = mean(rain))
‣ A problem arises! Our month_year
is a character, not a
date.
‣ That means it’s not continuous. Let’s try plotting:
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
‣ We need a different approach!
‣ Let’s round dates to the month using floor_date()
.
‣ This way, we get a true date variable for our grouping.
weather_summary_2 <- weather %>%
mutate(month_year= floor_date(date, unit = "months")) %>%
# group by and summarise
group_by(month_year) %>%
summarise(avg_rain = mean(rain))
‣ Now, let’s plot this newly aggregated data!
‣ That’s much better!
‣ Easier to see seasonal trends and yearly variations.
NOW TRY THIS FINAL PRACTICE QUESTION!
Plot avg monthly min and max temperatures
Using the weather data, create a new line graph plotting the average monthly minimum and maximum temperatures from 2015-2019.
weather %>%
mutate(month_year= floor_date(date, unit = "months")) %>%
# group by and summarise
group_by(month_year) %>%
summarise(avg_min = mean(min_temp),
avg_max = mean(max_temp)) %>%
ggplot() +
geom_line(aes(x=month_year, y = avg_min), color = "blue") +
geom_line(aes(x=month_year, y = avg_max),
color = "green")
This lesson covered fundamental skills for working with dates in R - calculating intervals, extracting components, rounding, and creating time series visualizations. With these key building blocks now mastered, you can can now start to wrangle date data to uncover and analyze patterns over time.
‣ You know how to calculate intervals between dates
‣ You know how to extract components from date columns
‣ You know how to round dates
‣ You are able to create simple time series graphs
Lubridate weeks
oct_31 <- as.Date("2023-10-31")
jul_20 <- as.Date("2023-07-20")
time_difference <- oct_31 %--% jul_20
time_difference/weeks(1)
## [1] -14.71429
Lubridate intervals
irs %>%
mutate(spraying_time = interval(start_date_default, end_date_default)/days(1)) %>%
select(spraying_time)
## # A tibble: 112 × 1
## spraying_time
## <dbl>
## 1 10
## 2 5
## 3 0
## 4 0
## 5 0
## 6 11
## 7 0
## 8 0
## 9 19
## 10 9
## # ℹ 102 more rows
Extracting weekdays
## # A tibble: 112 × 1
## wday_start
## <ord>
## 1 Mon
## 2 Tue
## 3 Tue
## 4 Tue
## 5 Tue
## 6 Thu
## 7 Tue
## 8 Tue
## 9 Wed
## 10 Wed
## # ℹ 102 more rows
Visualizing spray end months
irs %>%
mutate(month_end = month(end_date_default, label = TRUE)) %>%
ggplot(aes(x = month_end)) +
geom_bar()
Rounding dates practice
date_round <- as.Date("2001-11-29")
rounded_date <- floor_date(date_round, unit="year")
rounded_date
## [1] "2001-01-01"
Plot avg monthly min and max temperatures
weather %>%
mutate(month_year = floor_date(date, unit="month")) %>%
group_by(month_year) %>%
summarise(avg_min_temp = mean(min_temp),
avg_max_temp = mean(max_temp)) %>%
ggplot() +
geom_line(aes(x = month_year, y = avg_min_temp), color = "blue") +
geom_line(aes(x = month_year, y = avg_max_temp), color = "red")
The following team members contributed to this lesson: