Code
# Load packages
if(!require(pacman)) install.packages("pacman")
::p_load(dplyr, ggplot2, tidyr, lubridate, outbreaks, scales, ggrepel, ggthemes, zoo, here)
pacmanoptions(scipen=999)
By analyzing time series data—observations made sequentially over time—epidemiologists can spot trends and patterns in disease outbreaks, and inform decision-making for improved health outcomes. In this lesson, we explore using ggplot and the tidyverse to visualize time series data and effectively communicate insights.
By the end of this lesson you will be able to:
Install and load the necessary packages with the following code chunk:
# Load packages
if(!require(pacman)) install.packages("pacman")
::p_load(dplyr, ggplot2, tidyr, lubridate, outbreaks, scales, ggrepel, ggthemes, zoo, here)
pacmanoptions(scipen=999)
Setting options(scipen = 999)
prevents the use of scientific notation in our plots, making long numbers easier to read and interpret.
To get started with visualizing time series data, we’ll examine the dynamics of tuberculosis (TB) notifications in Australia over time, comparing notifications in urban and rural areas. The source dataset can be accessed here
Notifications are a technical term for the number of cases of a disease that are reported to public health authorities.
Let’s start by loading and inspecting the data:
<- read_csv(here::here("data/aus_tb_notifs.csv"))
tb_data_aus head(tb_data_aus)
# A tibble: 6 × 3
period rural urban
<chr> <dbl> <dbl>
1 1993Q1 6 51
2 1993Q2 11 52
3 1993Q3 13 67
4 1993Q4 14 82
5 1994Q1 16 63
6 1994Q2 15 65
This dataset includes the columns period
(time in quarterly format, e.g., ‘1993Q1’), rural
(cases in rural areas), and urban
(cases in urban areas).
We would like to visualize the number of annual TB notifications in urban and rural areas, but the data is currently in a quarterly format. So, we need to aggregate the data by year.
Let’s start by extracting the year from the period
column. We do this using the str_sub()
function from the stringr
package:
%>%
tb_data_aus mutate(year = str_sub(period, 1, 4)) %>%
# convert back to numeric
mutate(year = as.numeric(year))
# A tibble: 120 × 4
period rural urban year
<chr> <dbl> <dbl> <dbl>
1 1993Q1 6 51 1993
2 1993Q2 11 52 1993
3 1993Q3 13 67 1993
4 1993Q4 14 82 1993
5 1994Q1 16 63 1994
6 1994Q2 15 65 1994
7 1994Q3 18 60 1994
8 1994Q4 25 71 1994
9 1995Q1 7 77 1995
10 1995Q2 9 52 1995
# ℹ 110 more rows
The str_sub()
function takes three arguments: the string we want to extract from, the starting position, and the ending position. In this case, we want to extract the first four characters from the period
column, which correspond to the year.
Now, let’s aggregate the data by year. We can do this using the group_by()
and summarise()
functions:
<- tb_data_aus %>%
annual_data_aus mutate(year = str_sub(period, 1, 4)) %>%
mutate(year = as.numeric(year)) %>%
# group by year
group_by(year) %>%
# sum the number of cases in each year
summarise(rural = sum(rural),
urban = sum(urban))
annual_data_aus
# A tibble: 30 × 3
year rural urban
<dbl> <dbl> <dbl>
1 1993 44 252
2 1994 74 259
3 1995 26 264
4 1996 28 265
5 1997 20 272
6 1998 14 208
7 1999 41 262
8 2000 28 265
9 2001 14 278
10 2002 17 260
# ℹ 20 more rows
Now that we seem to have the data in the format we want, let’s make an initial line plot:
ggplot(annual_data_aus, aes(x = year)) +
geom_line(aes(y = urban, colour = "Urban")) +
geom_line(aes(y = rural, colour = "Rural"))
This is an informative plot, however, there is some unnecessary code duplication, though you may not yet realize it. This will become clearer if we try to add additional geoms, such as points, or text:
ggplot(annual_data_aus, aes(x = year)) +
geom_line(aes(y = urban, colour = "Urban")) +
geom_line(aes(y = rural, colour = "Rural")) +
geom_point(aes(y = urban, colour = "Urban")) +
geom_point(aes(y = rural, colour = "Rural")) +
geom_text(aes(y = urban, label = urban), size = 2, nudge_y = 20) +
geom_text(aes(y = rural, label = rural), size = 2, nudge_y = 20)
As you can see, we have to repeat the same lines of code for each geom. This is not only tedious, but also makes the code more difficult to read and interpret. If we had more than two categories, as often happens, it would be even more cumbersome.
Fortunately, there is a better way. We can use the pivot_longer()
function from the {tidyr} package to reshape the data into a format that is more suitable for plotting:
# Using tidyr's `pivot_longer` to reshape the data
%>%
annual_data_aus pivot_longer(cols = c("urban", "rural"))
# A tibble: 60 × 3
year name value
<dbl> <chr> <dbl>
1 1993 urban 252
2 1993 rural 44
3 1994 urban 259
4 1994 rural 74
5 1995 urban 264
6 1995 rural 26
7 1996 urban 265
8 1996 rural 28
9 1997 urban 272
10 1997 rural 20
# ℹ 50 more rows
The code above has converted the data from a “wide” format to a “long” format. This is a more suitable format for plotting, as it allows us to map a specific column to the colour
aesthetic.
Before we plot this long dataset, let’s rename the columns to make them more informative:
<- annual_data_aus %>%
aus_long pivot_longer(cols = c("urban", "rural")) %>%
rename(region = name, cases = value)
We’re ready to plot the data again. We map the colour and group aesthetics to the region
column, which contains the two categories of interest: urban and rural.
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line()
The plotting code is now more concise, thanks to the pivoting operation executed previously.
We can now also add points and text labels with significantly less code:
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line() +
geom_point() +
geom_text(aes(label = cases), size = 2, nudge_y = 20)
Great! We now have a clear view of of trends in annual TB case notifications in rural and urban areas over time. However, there are still some aesthetic improvements we can make; we will cover these in the next section.
Consider the Benin dataset shown below, which contains information about bacteriologically confirmed and clinically diagnosed TB cases for several years in Benin. (The data was sourced from a paper here
<- read_csv(here("data/benin_tb_notifs.csv"))
tb_data_benin
<- tb_data_benin %>%
benin_long pivot_longer(c('new_clindx', 'new_labconf')) %>%
rename(status = name, cases = value)
ggplot(benin_long, aes(x = year, y = cases, colour = status, group = status)) +
geom_line() +
geom_point() +
geom_text(aes(label = cases), size = 2, nudge_y = 20)
Reshape the dataset using pivot_longer()
, then create a plot with two lines, one for each type of TB case diagnosis. Add points and text labels to the plot.
In this section, we will focus on improving the aesthetics of time series line graphs to enhance their clarity and visual appeal.
Where we last left off, we had a plot that looked like this:
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line() +
geom_point() +
geom_text(aes(label = cases), size = 2, nudge_y = 20)
One problem with this plot is that the text labels are a bit too small. Such tiny labels are not ideal for a public-facing plot, as they are difficult to read. However, if we increase the label size, the labels will start to overlap, as shown below:
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line() +
geom_point() +
geom_text(aes(label = cases), size = 2.8, nudge_y = 20)
To avoid this clutter, a handy technique is to display labels for only certain years. To do this, we can give a custom dataset to the geom_text()
function. In this case, we will create a dataset that contains only the even years:
<- aus_long %>%
even_years filter(year %% 2 == 0) # Keep only years that are multiples of 2
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line() +
geom_point() +
geom_text(data = even_years, aes(label = cases),
size = 2.8, nudge_y = 20)
Great, now we have larger labels and they do not overlap.
While the plot above is an improvement, it would be even better if we could display the labels for all years. We can do this by displaying the labels for the even years above the data points, and the labels for the odd years below the data points.
Including many data points (within reason) in your plots is helpful for public health officials; as they can pull quick numbers from the plot when trying to make decisions, without needing to look at the reference datasets.
To address this, let’s create a filtered dataset for odd years, and then use geom_text()
twice, once for each filtered dataset.
<- aus_long %>%
odd_years filter(year %% 2 != 0) # Keep only years that are NOT multiples of 2
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line() +
geom_point() +
geom_text(data = even_years, aes(label = cases),
nudge_y = 20, size = 2.8) +
geom_text(data = odd_years, aes(label = cases),
nudge_y = -20, size = 2.8)
ggrepel::geom_text_repel()
The plot above is clear, but there is still some overlap between the labels and the line.
To further enhance clarity, we can use the geom_text_repel()
from the {ggrepel} package.
This function nudges individual labels to prevent overlap, and connects labels to their data points with lines, making it easier to see which label corresponds to which data point, and allowing us to increase the distance between the labels and the data points.
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line() +
geom_point() +
geom_text_repel(data = even_years, aes(label = cases),
nudge_y = 60, size = 2.8, segment.size = 0.1) +
geom_text_repel(data = odd_years, aes(label = cases),
nudge_y = -60, size = 2.8, segment.size = 0.1)
As you can see, the function geom_text_repel()
takes basically the same arguments as geom_text()
. The extra argument, segment.size
, controls the width of the lines connecting the labels to the data points.
It is often useful to customize the color palette of your plots, so that they match, for example, the color scheme of your organization.
We can customize the colors of the lines using the scale_color_manual()
function. Below, we specify two colors, one for each region:
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line() +
geom_point() +
geom_text_repel(data = even_years, aes(label = cases),
nudge_y = 60, size = 2.8, segment.size = 0.1) +
geom_text_repel(data = odd_years, aes(label = cases),
nudge_y = -60, size = 2.8, segment.size = 0.1) +
scale_color_manual(values = c("urban" = "#0fa3b1",
"rural" = "#2F2C4E"))
Success!
Finally, let’s add a set of finish touches. We’ll annotate the plot with appropriate titles, axis labels, and captions, and modify the theme:
ggplot(aus_long, aes(x = year, y = cases, colour = region, group = region)) +
geom_line(linewidth = 1) +
geom_text_repel(data = even_years, aes(label = cases),
nudge_y = 60, size = 2.8, segment.size = 0.08) +
geom_text_repel(data = odd_years, aes(label = cases),
nudge_y = -50, size = 2.8, segment.size = 0.08) +
scale_color_manual(values = c("urban" = "#0fa3b1", "rural" = "#2F2C4E")) +
labs(title = "Tuberculosis Notifications in Australia",
subtitle = "1993-2022",
caption = "Source: Victoria state Government Department of Health",
x = "Year",
color = "Region") +
::theme_few() +
ggthemestheme(legend.position = "right")
This covers some options for improving line chart aesthetics! Feel free to further tweak and adjust visuals based on your specific analysis needs.
We’ve transformed our plot into a visually appealing, easy-to-read representation of TB notification trends in Australia. We balanced the need for detailed information with a clear presentation, making our plot both informative and accessible.
Consider the following plot, which shows the number of child TB cases in three countries over time:
<- tidyr::who2 %>%
tb_child_cases_southam transmute(country, year,
tb_cases_children = sp_m_014 + sp_f_014 + sn_m_014 + sn_f_014) %>%
filter(country %in% c("Brazil", "Colombia", "Chile")) %>%
filter(!is.na(tb_cases_children))
<- tb_child_cases_southam %>%
years_even filter(year %% 2 == 0)
<- tb_child_cases_southam %>%
years_odd filter(year %% 2 == 1)
%>%
tb_child_cases_southam ggplot(aes(x = year, y = tb_cases_children, color = country)) +
geom_line() +
geom_point() +
geom_text_repel(data = years_even, aes(label = tb_cases_children), nudge_y = 60, size = 2, segment.size = 0.08) +
geom_text_repel(data = years_odd, aes(label = tb_cases_children), nudge_y = -50, size = 2, segment.size = 0.08) +
scale_color_manual(values = c('Brazil' = "#212738", 'Colombia' ="#F97068", 'Chile' = "#067BC2")) +
labs(title = "Tuberculosis Notifications in 3 South American Countries",
subtitle = "2006-2012",
caption = "Source: World Health Organisation",
x = "Year",
color = "Countries") +
theme_classic() +
theme(legend.position = "right")
Build on this plot, implementing the following improvements:
geom_text
labels to alternate above and below the lines, similar to the example we saw above.c("#212738", "#F97068", "#067BC2")
theme_classic()
?tidyr::who2
into the console to learn more about the data source.)geom_ribbon()
In time series visualizations, it is often important to plot confidence intervals to indicate the level of uncertainty in your data.
We will demonstrate how to do this using a dataset on new HIV infections in Brazil, which includes estimated numbers for male and female cases along with confidence intervals. The dataset is sourced from the World Health Organization (WHO) and can be accessed here.
Let’s start by loading and inspecting the dataset:
<-
hiv_data_brazil ::import(here("data/new_hiv_infections_gho.xlsx"),
riosheet = "Brazil") %>%
as_tibble() %>%
::clean_names()
janitor hiv_data_brazil
# A tibble: 89 × 5
continent country year sex new_hiv_cases
<chr> <chr> <dbl> <chr> <chr>
1 Americas Brazil 2022 Female 13 000 [12 000 – 15 000]
2 Americas Brazil 2022 Male 37 000 [35 000 – 40 000]
3 Americas Brazil 2022 Both sexes 51 000 [47 000 – 54 000]
4 Americas Brazil 2021 Female 14 000 [12 000 – 15 000]
5 Americas Brazil 2021 Male 37 000 [34 000 – 39 000]
6 Americas Brazil 2021 Both sexes 50 000 [47 000 – 53 000]
7 Americas Brazil 2020 Female 14 000 [13 000 – 15 000]
8 Americas Brazil 2020 Male 35 000 [32 000 – 37 000]
9 Americas Brazil 2020 Both sexes 49 000 [46 000 – 52 000]
10 Americas Brazil 2019 Female 14 000 [13 000 – 15 000]
# ℹ 79 more rows
We can see that the new_hiv_cases
column contains both the number of cases and the corresponding confidence intervals in square brackets. This format cannot be directly used for plotting, so we will need to extract them into pure numeric forms.
First, to separate these values, we can use the separate()
function from the {tidyr} package:
%>%
hiv_data_brazil separate(new_hiv_cases,
into = c("cases", "cases_lower", "cases_upper"),
sep = "\\[|–")
# A tibble: 89 × 7
continent country year sex cases cases_lower cases_upper
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 Americas Brazil 2022 Female "13 000 " "12 000 " " 15 000]"
2 Americas Brazil 2022 Male "37 000 " "35 000 " " 40 000]"
3 Americas Brazil 2022 Both sexes "51 000 " "47 000 " " 54 000]"
4 Americas Brazil 2021 Female "14 000 " "12 000 " " 15 000]"
5 Americas Brazil 2021 Male "37 000 " "34 000 " " 39 000]"
6 Americas Brazil 2021 Both sexes "50 000 " "47 000 " " 53 000]"
7 Americas Brazil 2020 Female "14 000 " "13 000 " " 15 000]"
8 Americas Brazil 2020 Male "35 000 " "32 000 " " 37 000]"
9 Americas Brazil 2020 Both sexes "49 000 " "46 000 " " 52 000]"
10 Americas Brazil 2019 Female "14 000 " "13 000 " " 15 000]"
# ℹ 79 more rows
In the code above, we split the new_hiv_cases
column into three new columns: cases
, cases_lower
, and cases_upper
. We use the [
and –
as separators. The double backslash \\
is used to escape the square bracket, which has a special meaning in regular expressions. And the |
is used to indicate that either the [
or the –
can be used as a separator.
Large Language Models like ChatGPT are excellent at regular expression understanding. If you’re ever stuck with code like sep = "\\[|–"
and want to understand what it does, you can ask ChatGPT to explain it to you. And if you need to generate such expressions yourself, you can ask ChatGPT to generate them for you.
Next, we need to convert these string values into numeric values, removing any non-numeric characters.
<-
hiv_data_brazil_clean %>%
hiv_data_brazil separate(new_hiv_cases,
into = c("cases", "cases_lower", "cases_upper"),
sep = "\\[|–") %>%
mutate(across(c("cases", "cases_lower", "cases_upper"),
~ str_replace_all(.x, "[^0-9]", "") %>%
as.numeric()))
hiv_data_brazil_clean
# A tibble: 89 × 7
continent country year sex cases cases_lower cases_upper
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Americas Brazil 2022 Female 13000 12000 15000
2 Americas Brazil 2022 Male 37000 35000 40000
3 Americas Brazil 2022 Both sexes 51000 47000 54000
4 Americas Brazil 2021 Female 14000 12000 15000
5 Americas Brazil 2021 Male 37000 34000 39000
6 Americas Brazil 2021 Both sexes 50000 47000 53000
7 Americas Brazil 2020 Female 14000 13000 15000
8 Americas Brazil 2020 Male 35000 32000 37000
9 Americas Brazil 2020 Both sexes 49000 46000 52000
10 Americas Brazil 2019 Female 14000 13000 15000
# ℹ 79 more rows
The code above looks complex, but essentially, it cleans the data by keeping only numeric characters and then converts these numbers to actual numeric values. See our lesson on the across()
function for further details.
We’re finally ready to plot the data. We’ll use ggplot’s geom_ribbon()
to display the confidence intervals:
%>%
hiv_data_brazil_clean filter(sex == "Both sexes") %>%
ggplot(aes(x = year, y = cases)) +
geom_line() +
geom_ribbon(aes(ymin = cases_lower, ymax = cases_upper), alpha = 0.4)
The geom_ribbon()
function takes the x
and y
aesthetics like geom_line()
, but it also takes in ymin
and ymax
aesthetics, to determine the vertical extent of the ribbon. We also set the transparency of the ribbon using the alpha
argument.
We can create a separate ribbon for men and women to compare their infection trends.
%>%
hiv_data_brazil_clean filter(sex != "Both sexes") %>%
ggplot(aes(x = year, y = cases, color = sex, fill = sex)) +
geom_line() +
geom_ribbon(aes(ymin = cases_lower, ymax = cases_upper), alpha = 0.4)
Notably HIV infection rates among women have been falling in recent years, but those among men have been rising.
Consider the following dataset that shows the number of annual malaria cases in Kenya and Nigeria. The data is sourced from the WHO Global Health Observatory data repository and can be accessed here.
<- read_csv("data/nigeria_kenya_malaria.csv")
nig_ken_mal
<- nig_ken_mal %>%
nig_ken_mal1 separate(malaria_cases,
into = c("cases", "cases_lower", "cases_upper"),
sep = " \\(| to |\\)",
extra = "drop") %>% # Drops any remaining unmatched part
mutate(across(c("cases", "cases_lower", "cases_upper"),
~ str_replace_all(.x, "[^0-9]", "") %>%
as.numeric()))
%>%
nig_ken_mal1 ggplot(aes(x = year, y = cases, color = country, fill = country)) + # plot cases vs. year, with color and fill set to country
geom_line() + # add line graph
geom_ribbon(aes(ymin = cases_lower, ymax = cases_upper), alpha = 0.4) # add confidence intervals
Write code to extract the confidence intervals from the “malaria_cases” column and create a plot with confidence intervals using geom_ribbon()
. Use a different color for each country.
When analyzing time series data, it is common for daily or granular measurements to show a lot of noise and variability, and this can hide the important trends we are actually interested in. Smoothing techniques can help highlight these trends and patterns. We’ll explore several techniques for this in the sections below.
First though, let’s do some data preparation!
Consider the following linelist of pediatric malaria admissions in four hospitals in Mozambique (Data source):
<-
mal ::import(here("data/pediatric_malaria_data_joao_2021.xlsx")) %>%
rioas_tibble() %>%
mutate(date_positive_test = as.Date(date_positive_test)) %>%
# Keep data from 2019-2020
filter(date_positive_test >= as.Date("2019-01-01"),
<= as.Date("2020-12-31"))
date_positive_test mal
# A tibble: 20,939 × 4
date_positive_test neighbourhood sex age
<date> <chr> <chr> <chr>
1 2020-01-22 25 de Junho 1 M 6-11 meses
2 2020-01-22 Chicueu F 5-14 anos
3 2020-01-22 Mussessa M 5-14 anos
4 2020-01-22 Nhamizara M 12-23 meses
5 2020-01-22 Nhamizara F 12-23 meses
6 2020-01-22 Unidade F 5-14 anos
7 2020-01-22 Bapua F 12-23 meses
8 2020-01-22 Bapua F 5-14 anos
9 2020-01-22 7 de Abril M 12-23 meses
10 2020-01-22 Nhanguzue F 5-14 anos
# ℹ 20,929 more rows
Each row corresponds to a single malaria case, and the date_positive_test
column indicates the date when the child tested positive for malaria.
To get a count of cases per day—that is, an incidence table—we can simply use count()
to aggregate the cases by date of positive test:
%>%
mal count(date_positive_test, name = "cases")
# A tibble: 235 × 2
date_positive_test cases
<date> <int>
1 2019-01-01 67
2 2019-01-02 120
3 2019-01-03 112
4 2019-01-04 203
5 2019-01-05 85
6 2019-01-07 115
7 2019-01-08 196
8 2019-01-10 89
9 2019-01-11 55
10 2019-01-12 69
# ℹ 225 more rows
mal
# A tibble: 20,939 × 4
date_positive_test neighbourhood sex age
<date> <chr> <chr> <chr>
1 2020-01-22 25 de Junho 1 M 6-11 meses
2 2020-01-22 Chicueu F 5-14 anos
3 2020-01-22 Mussessa M 5-14 anos
4 2020-01-22 Nhamizara M 12-23 meses
5 2020-01-22 Nhamizara F 12-23 meses
6 2020-01-22 Unidade F 5-14 anos
7 2020-01-22 Bapua F 12-23 meses
8 2020-01-22 Bapua F 5-14 anos
9 2020-01-22 7 de Abril M 12-23 meses
10 2020-01-22 Nhanguzue F 5-14 anos
# ℹ 20,929 more rows
There are many dates missing though—days when no children were admitted. To create a complete incidence table, we should use complete()
to insert missing dates and then fill in the missing values with 0:
<- mal %>%
mal_notif_count count(date_positive_test, name = "cases") %>%
complete(date_positive_test = seq.Date(min(date_positive_test),
max(date_positive_test),
by = "day"),
fill = list(cases = 0))
mal_notif_count
# A tibble: 406 × 2
date_positive_test cases
<date> <int>
1 2019-01-01 67
2 2019-01-02 120
3 2019-01-03 112
4 2019-01-04 203
5 2019-01-05 85
6 2019-01-06 0
7 2019-01-07 115
8 2019-01-08 196
9 2019-01-09 0
10 2019-01-10 89
# ℹ 396 more rows
Now we have a complete incidence table with the number of cases on 406 consecutive days.
We can now plot the data to see the overall trend:
# Create a basic epicurve using ggplot2
ggplot(mal_notif_count, aes(x = date_positive_test, y = cases)) +
geom_line()
We have a valid epicurve, but as you may notice, the daily variability makes it hard to see the overall trend. Let’s smooth things out.
geom_smooth()
One option for smoothing is the geom_smooth()
function, which can perform local regression with loess
to smooth out the time series. Let’s try it out:
ggplot(mal_notif_count, aes(x = date_positive_test, y = cases)) +
geom_smooth()
# Or we can specify the method explicitly
ggplot(mal_notif_count, aes(x = date_positive_test, y = cases)) +
geom_smooth(method = "loess")
The loess
methods, which stands for locally weighted scatterplot smoothing, fits a smooth curve to the data by calculating weighted averages for nearby points.
You can adjust the sensitivity of the smoothing by modifying the span
argument. A span of 0.1 will result in a more sensitive smoothing, while a span of 0.9 will result in a less sensitive smoothing.
# Adjust the sensitivity of the smoothing
ggplot(mal_notif_count, aes(x = date_positive_test, y = cases)) +
geom_smooth(method = "loess", span = 0.1)
ggplot(mal_notif_count, aes(x = date_positive_test, y = cases)) +
geom_smooth(method = "loess", span = 0.9)
Another way to smooth data is by aggregating it—grouping the data into larger time intervals and calculating summary statistics for each interval.
We already saw this at the start of the lesson, when we aggregated quarterly data to yearly data.
Let’s apply it again, this time aggregating daily malaria incidence to monthly incidence. To do this, we employ the floor_date()
function from the lubridate
package to round the dates down to the nearest month:
%>%
mal_notif_count mutate(month = floor_date(date_positive_test, unit = "month"))
# A tibble: 406 × 3
date_positive_test cases month
<date> <int> <date>
1 2019-01-01 67 2019-01-01
2 2019-01-02 120 2019-01-01
3 2019-01-03 112 2019-01-01
4 2019-01-04 203 2019-01-01
5 2019-01-05 85 2019-01-01
6 2019-01-06 0 2019-01-01
7 2019-01-07 115 2019-01-01
8 2019-01-08 196 2019-01-01
9 2019-01-09 0 2019-01-01
10 2019-01-10 89 2019-01-01
# ℹ 396 more rows
We can then use group_by()
and summarize()
to calculate the total number of cases per month:
<-
mal_monthly %>%
mal_notif_count mutate(month = floor_date(date_positive_test, unit = "month")) %>%
group_by(month) %>%
summarize(cases = sum(cases))
This gives us a monthly incidence table, which we can plot to see the overall trend:
ggplot(mal_monthly) +
geom_line(aes(x = month, y = cases))
Voila! A much clearer picture.
Consider this dataset of individuals who died from HIV in Colombia between 2010 and 2016, sourced from this URL.
<-
colom_hiv_deaths read_csv(here("data/colombia_hiv_deaths_2010_to_2016.csv")) %>%
mutate(date_death = ymd(paste(death_year, death_month, death_day, sep = "-")))
<- colom_hiv_deaths %>%
colom_mon count(date_death, name = "cases") %>%
complete(date_death = seq.Date(min(date_death),
max(date_death),
by = "month"),
fill = list(cases = 0)) %>%
mutate(month = floor_date(date_death, unit = "month")) %>%
group_by(month) %>%
summarize(cases = sum(cases))
ggplot(colom_mon, aes(x = month, y = cases)) +
geom_smooth(method = "loess", span = 0.1) +
scale_x_date(date_breaks = "12 months", date_labels = "%b %Y")
Using the steps taught above:
geom_smooth
to the epicurve for a smoother visualization. Ensure you choose an appropriate span for smoothing.Another technique to smooth noisy time series data is to calculate rolling averages. This takes the average of a fixed number of points, centered around each data point.
The rollmean()
function from the {zoo} package will be your primary work-horse for calculating rolling means.
Let’s apply a 14 day rolling average to smooth our daily malaria case data:
<- mal_notif_count %>%
mal_notif_count mutate(roll_cases = rollmean(cases, k = 14, fill = NA))
mal_notif_count
# A tibble: 406 × 3
date_positive_test cases roll_cases
<date> <int> <dbl>
1 2019-01-01 67 NA
2 2019-01-02 120 NA
3 2019-01-03 112 NA
4 2019-01-04 203 NA
5 2019-01-05 85 NA
6 2019-01-06 0 NA
7 2019-01-07 115 83.4
8 2019-01-08 196 82.9
9 2019-01-09 0 79.3
10 2019-01-10 89 82.1
# ℹ 396 more rows
The key arguments are:
x
: The time series to smoothk
: The number of points before and after to averagefill
: How to handle missing data within each windowThis calculates the 14-day moving average, leaving missing data as NA. Notice that the first 6 days are NA, since there are not enough points to average over (with a k
of 14, we need 7 days before and 7 days after each point to calculate the rolling average.
Let’s plot it:
%>%
mal_notif_count ggplot(aes(x = date_positive_test, y = cases)) +
geom_line(color = "gray80")
Commonly, you will be asked to plot a rolling average of the past 1 or 2 weeks. For this, you must set the align
argument to "right"
:
<-
mal_notif_count_right %>%
mal_notif_count mutate(roll_cases = rollmean(cases, k = 14, fill = NA, align = "right"))
head(mal_notif_count_right, 15)
# A tibble: 15 × 3
date_positive_test cases roll_cases
<date> <int> <dbl>
1 2019-01-01 67 NA
2 2019-01-02 120 NA
3 2019-01-03 112 NA
4 2019-01-04 203 NA
5 2019-01-05 85 NA
6 2019-01-06 0 NA
7 2019-01-07 115 NA
8 2019-01-08 196 NA
9 2019-01-09 0 NA
10 2019-01-10 89 NA
11 2019-01-11 55 NA
12 2019-01-12 69 NA
13 2019-01-13 0 NA
14 2019-01-14 57 83.4
15 2019-01-15 60 82.9
Notice that now the first 13 days are NA, since there are not enough points to average over. This is because we are calculating the average of the past 14 days, and the first 13 days do not have 14 days before them.
The output does not change much in this case:
ggplot(mal_notif_count_right, aes(x = date_positive_test, y = cases)) +
geom_line(color = "gray80") +
geom_line(aes(y = roll_cases), color = "red")
Finally, let’s plot both the original and smoothed data:
%>%
mal_notif_count ggplot(aes(x = date_positive_test, y = cases)) +
geom_line(color = "gray80") +
geom_line(aes(y = roll_cases), color = "red")
In summary, the rollmean()
function lets us easily compute a rolling average over a fixed window to smooth and highlight patterns in noisy time series data.
Consider again the dataset of HIV patient deaths in Colombia:
colom_hiv_deaths
# A tibble: 445 × 26
municipality_type death_location birth_date birth_year birth_month birth_day
<chr> <chr> <date> <dbl> <chr> <dbl>
1 Municipal head Hospital/clinic 1956-05-26 1956 May 26
2 Municipal head Hospital/clinic 1983-10-10 1983 Oct 10
3 Municipal head Hospital/clinic 1967-11-22 1967 Nov 22
4 Municipal head Home/address 1964-03-14 1964 Mar 14
5 Municipal head Hospital/clinic 1960-06-27 1960 Jun 27
6 Municipal head Hospital/clinic 1982-03-23 1982 Mar 23
7 Municipal head Hospital/clinic 1964-12-09 1964 Dec 9
8 Municipal head Hospital/clinic 1975-01-15 1975 Jan 15
9 Municipal head Hospital/clinic 1988-02-15 1988 Feb 15
10 Municipal head Hospital/clinic NA NA <NA> NA
# ℹ 435 more rows
# ℹ 20 more variables: death_year <dbl>, death_month <chr>, death_day <dbl>,
# age_at_death <dbl>, gender <chr>, education_level <chr>, occupation <chr>,
# racial_id <chr>, health_insurance_status <chr>, marital_status <chr>,
# municipality_name <chr>, primary_cause_death_description <chr>,
# primary_cause_death_code <chr>, secondary_cause_death_description <chr>,
# secondary_cause_death_code <chr>, tertiary_cause_death_description <chr>, …
The following code calculates the number of deaths per day:
<-
colom_hiv_deaths_per_day %>%
colom_hiv_deaths group_by(date_death) %>%
summarize(deaths = n()) %>%
complete(date_death = seq.Date(min(date_death),
max(date_death),
by = "day"),
fill = list(deaths = 0))
%>%
colom_hiv_deaths_per_day mutate(roll_cases = rollmean(x = deaths, k = 14, fill = NA)) %>%
ggplot(aes(x = date_death, y = deaths)) +
geom_line(color = "gray80") +
geom_line(aes(y = roll_cases), color = "red")
Your task is to create a new column that calculates the rolling average of deaths per day over a 14-day period. Then, plot this rolling average alongside the raw data.
A secondary y-axis is helpful when visualizing two different measures with distinct scales on the same plot. This approach is useful when the variables have different units or magnitudes, making direct comparison on a single scale challenging.
While some data visualization experts caution against using secondary axes, public health decision-makers often appreciate these plots.
Let’s demonstrate how to create a plot with a secondary y-axis using our dataset of malaria notifications:
Step 1: Create Cumulative Case Counts
First, we’ll aggregate our malaria data to calculate cumulative case counts.
<-
mal_notif_count_cumul %>%
mal_notif_count mutate(cumul_cases = cumsum(cases))
mal_notif_count_cumul
# A tibble: 406 × 4
date_positive_test cases roll_cases cumul_cases
<date> <int> <dbl> <int>
1 2019-01-01 67 NA 67
2 2019-01-02 120 NA 187
3 2019-01-03 112 NA 299
4 2019-01-04 203 NA 502
5 2019-01-05 85 NA 587
6 2019-01-06 0 NA 587
7 2019-01-07 115 83.4 702
8 2019-01-08 196 82.9 898
9 2019-01-09 0 79.3 898
10 2019-01-10 89 82.1 987
# ℹ 396 more rows
Step 2: Identifying the Need for a Secondary Y-Axis
Now, we can start plotting. First, we plot just the daily cases:
# Plotting total malaria cases
ggplot(mal_notif_count_cumul, aes(x = date_positive_test)) +
geom_line(aes(y = cases))
If we try to add cumulative cases on the same y-axis, the daily cases will be dwarfed and their magnitude will hard to read due to the much larger scale of cumulative data:
# Adding cumulative malaria cases to the plot
ggplot(mal_notif_count_cumul, aes(x = date_positive_test)) +
geom_line(aes(y = cases)) +
geom_line(aes(y = cumul_cases), color = "red")
To effectively display both sets of data, we must introduce a secondary y-axis.
Step 3: Calculating and Applying the Scale Factor
Before adding a secondary axis, we need to determine a scale factor by comparing the ranges of cases and cumulative cases.
The scale factor is typically the ratio of the maximum values of the two datasets. Let’s see what the maximum values are for each variable:
max(mal_notif_count_cumul$cases)
[1] 419
max(mal_notif_count_cumul$cumul_cases)
[1] 20939
With a maximum or around 20000 for the cumulative cases, and about 400 for the daily cases, we can see that the cumulative cases are about 50 times larger than the daily cases, so our scale factor will be about 50.
More precisely, the scale factor will be:
max(mal_notif_count_cumul$cumul_cases) / max(mal_notif_count_cumul$cases)
[1] 49.97375
We’ll need to divide the cumulative cases by this ratio to force the two variables to be on a similar scale:
ggplot(mal_notif_count_cumul, aes(x = date_positive_test)) +
geom_line(aes(y = cases)) +
geom_line(aes(y = cumul_cases / 49.97), color = "red") # divide by scale factor
Great! Now we can see both sets of data clearly on the same plot, and their maximum points are aligned. However, the y-axis is no longer relevant for the red cumulative cases line. We need to add a secondary y-axis for this.
Normally, you would assign the scale factor to a variable, and then use that variable in the geom_line()
function. In this case, we’re typing it out directly in the function for easier understanding.
Step 4: Adding the Secondary Y-Axis
We’ll use the sec_axis()
function from {ggplot2}. The key arguments are trans
, which indicates how much to multiply or divide the original y axis, and name
, which specifies the name of the secondary axis.
In our case, we want the secondary axis to be about 49.97 times larger than the original axis, so we’ll use trans = ~ .x * 49.97
. (The ~
symbol is a special operator that tells R to treat the expression that follows it as a function, whose input is indicated by the .x
symbol.)
Let’s implement this:
# Add a secondary y-axis
ggplot(mal_notif_count_cumul, aes(x = date_positive_test)) +
geom_line(aes(y = cases)) +
geom_line(aes(y = cumul_cases / 49.97), color = "red") +
scale_y_continuous(sec.axis = sec_axis(trans = ~ .x * 49.97,
name = "Cumulative Cases"))
Step 5: Enhancing Plot Readability
To improve readability, we’ll make the secondary axis labels red, matching the color of the cumulative cases line, and we’ll add some additional formatting to the plot:
# Finalizing the plot with color-coordinated axes
ggplot(mal_notif_count_cumul, aes(x = date_positive_test)) +
geom_line(aes(y = cases)) +
geom_line(aes(y = cumul_cases / 49.97), color = "red") +
scale_y_continuous(
name = "Daily Cases",
sec.axis = sec_axis(~ . * 49.97, name = "Cumulative Cases")
+
) labs(title = "Malaria Cases in Sussundenga Municipality",
subtitle = "Daily and Cumulative Cases",
x = NULL) +
theme_economist() +
theme(axis.text.y.right = element_text(color = "red"),
axis.title.y.right = element_text(color = "red"))
All done! We’ve successfully added a secondary y-axis to a plot, enabling the comparison of two datasets with different scales in a single visualization.
Revisit the dataset colom_hiv_deaths_per_day
.
<- colom_hiv_deaths_per_day %>%
colom_hiv_cum_per_day mutate(cumul_deaths = cumsum(deaths))
# Calculate scale factor as larger value divided by smaller value
<- max(colom_hiv_cum_per_day$cumul_deaths)/max(colom_hiv_cum_per_day$deaths)
fact
ggplot(colom_hiv_cum_per_day, aes(x = date_death)) +
geom_line(aes(y = deaths)) +
geom_line(aes(y = cumul_deaths / fact), color = "purple") +
scale_y_continuous(name = "Daily Death", sec.axis = sec_axis(trans = ~ .x * fact, name = "Cumulative Death")) +
labs(title = "HIV Death Rates in Colombia",
subtitle = "Daily and Cumulative Cases",
x = NULL) +
theme_economist() +
theme(axis.text.y.right = element_text(color = "purple"),
axis.title.y.right = element_text(color = "purple"))
Your task is to create a plot with two y-axes: one for the daily deaths and another for the cumulative deaths in Colombia.
In this lesson, you developed key skills for wrangling, visualizing, and enhancing time series data to uncover and communicate meaningful trends over time. These skills will come in handy as you continue to explore and analyze time series data in R.
<- tb_data_benin %>%
tb_benin_long pivot_longer(cols = c("new_clindx", "new_labconf")) %>%
rename(type = name, cases = value)
ggplot(tb_benin_long, aes(x = year, y = cases, colour = type, group = type)) +
geom_line() +
geom_point() +
geom_text(aes(label = cases), size = 2.2, nudge_y = 100, color = "black")
<- tb_child_cases_southam %>%
even_years_southam filter(year %% 2 == 0) # Keep only years that are multiples of 2
<- tb_child_cases_southam %>%
odd_years_southam filter(year %% 2 == 1) # Keep only years that are not multiples of 2
%>%
tb_child_cases_southam ggplot(aes(x = year, y = tb_cases_children, color = country)) +
geom_line() +
geom_point() +
geom_text(data = even_years_southam, aes(label = tb_cases_children),
nudge_y = 100, size = 2.8) +
geom_text(data = odd_years_southam, aes(label = tb_cases_children),
nudge_y = -100, size = 2.8) +
scale_color_manual(values = c("#212738", "#F97068", "#067BC2")) +
labs(title = "Tuberculosis Notifications in Three South American Countries",
subtitle = "Child Cases, 1993-2022",
caption = "Source: World Health Organization",
x = "Year",
y = "Number of Cases",
color = "Country") +
theme_classic()
%>%
nig_ken_mal separate(malaria_cases,
into = c("cases", "cases_lower", "cases_upper"),
sep = "\\(|to") %>%
mutate(across(c("cases", "cases_lower", "cases_upper"),
~ str_replace_all(.x, "[^0-9]", "") %>%
as.numeric()
%>%
)) ggplot(aes(x = year, y = cases, color = country, fill = country)) +
geom_line() +
geom_ribbon(aes(ymin = cases_lower, ymax = cases_upper), alpha = 0.4)
<-
hiv_monthly_deaths_table %>%
colom_hiv_deaths # Aggregate data to count deaths per month
mutate(month = floor_date(date_death, unit = "month")) %>%
group_by(month) %>%
summarize(deaths = n())
# Create the epicurve
ggplot(hiv_monthly_deaths_table, aes(x = month, y = deaths)) +
# Apply smoothing to the curve
geom_smooth(method = "loess", span = 0.1) +
scale_x_date(date_breaks = "12 months", date_labels = "%b %Y")
%>%
colom_hiv_deaths_per_day mutate(roll_deaths = rollmean(deaths, k = 14, fill = NA)) %>%
ggplot(aes(x = date_death, y = deaths)) +
geom_line(color = "gray80") +
geom_line(aes(y = roll_deaths), color = "red")
# Step 1: Calculate cumulative deaths
<- colom_hiv_deaths_per_day %>%
colom_hiv_deaths_cumul mutate(cum_deaths = cumsum(deaths))
# Step 2: Plot daily deaths
ggplot(colom_hiv_deaths_cumul, aes(x = date_death)) +
geom_line(aes(y = deaths))
# Step 3: Calculate scale factor
<- max(colom_hiv_deaths_cumul$cum_deaths) / max(colom_hiv_deaths_cumul$deaths)
scale_factor
# Step 4: Add cumulative deaths to the plot
ggplot(colom_hiv_deaths_cumul, aes(x = date_death)) +
geom_line(aes(y = deaths)) +
geom_line(aes(y = cum_deaths / scale_factor), color = "red")
# Step 5: Add secondary y-axis
ggplot(colom_hiv_deaths_cumul, aes(x = date_death)) +
geom_line(aes(y = deaths)) +
geom_line(aes(y = cum_deaths / scale_factor), color = "red") +
scale_y_continuous(sec.axis = sec_axis(trans = ~ .x * scale_factor, name = "Cumulative Deaths")) +
theme(axis.text.y.right = element_text(color = "red"),
axis.title.y.right = element_text(color = "red"))
The following team members contributed to this lesson: