Code
# Load packages
::p_load(tidyverse, gt, here) pacman
Tables are a powerful tool for visualizing data in a clear and concise format. With R and the gt package, we can leverage the visual appeal of tables to efficiently communicate key information. In this lesson, we will learn how to build aesthetically pleasing, customizable tables that support data analysis goals.
Use the gt()
function to create basic tables: Learn to initiate table creation in R using gt()
for data display.
Group columns under spanner headings: Master grouping related columns under shared headings for enhanced table organization.
Add stubs: Implement stubs to provide row labels or descriptions in tables for better clarity.
Relabel column names: Learn to rename column labels for improved readability and context in tables.
Add summary rows for groups: Understand adding aggregate data rows like sums or averages per group for comprehensive analysis.
By the end, you will be able to generate polished, reproducible tables like this:
We will use these packages:
{gt}
for creating tables{tidyverse}
for data wrangling{here}
for file paths# Load packages
::p_load(tidyverse, gt, here) pacman
Our data comes from the Malawi HIV Program and covers antenatal care and HIV treatment during 2019. We will focus on quarterly regional and facility-level aggregates (available here).
# Import data
<- read_csv(here::here("data/clean/hiv_malawi.csv")) hiv_malawi
Let’s explore the variables:
# First 6 rows
head(hiv_malawi)
# Variable names and types
glimpse(hiv_malawi)
Rows: 17,235
Columns: 29
$ region <chr> "Northern Region", "Northern Re…
$ zone <chr> "Northern Zone", "Northern Zone…
$ district <chr> "Chitipa", "Chitipa", "Chitipa"…
$ traditional_authority <chr> "Senior TA Bulambya Songwe", "S…
$ facility_name <chr> "Kapenda Health Centre", "Kapen…
$ datim_code <chr> "K9u9BIAaJJT", "K9u9BIAaJJT", "…
$ system <chr> "e-mastercard", "e-mastercard",…
$ hsector <chr> "Public", "Public", "Public", "…
$ period <chr> "2019 Q1", "2019 Q1", "2019 Q1"…
$ reporting_period <chr> "1st month of quarter", "1st mo…
$ sub_groups <chr> "All patients (checked data)", …
$ new_women_registered <dbl> 45, NA, 40, NA, 43, NA, 32, NA,…
$ total_women_in_booking_cohort <dbl> NA, 55, NA, 44, NA, 37, NA, 43,…
$ not_tested_for_syphilis <dbl> NA, 45, NA, 19, NA, 5, NA, 43, …
$ syphilis_negative <dbl> NA, 10, NA, 25, NA, 32, NA, 0, …
$ syphilis_positive <dbl> NA, 0, NA, 0, NA, 0, NA, 0, NA,…
$ hiv_status_not_ascertained <dbl> 4, 7, 9, 4, 9, 5, 3, 3, 4, 26, …
$ previous_negative <dbl> 0, 0, 0, 0, 0, 1, 3, 5, 1, 3, 0…
$ previous_positive <dbl> 0, 0, 0, 1, 1, 0, 1, 1, 1, 3, 2…
$ new_negative <dbl> 40, 47, 30, 38, 33, 31, 25, 34,…
$ new_positive <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0…
$ not_on_cpt <dbl> NA, 0, NA, 0, NA, 0, NA, 0, NA,…
$ on_cpt <dbl> NA, 1, NA, 2, NA, 0, NA, 1, NA,…
$ no_ar_vs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ already_on_art_when_starting_anc <dbl> 0, 1, 0, 1, 1, 0, 1, 1, 1, 3, 2…
$ started_art_at_0_27_weeks_of_pregnancy <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
$ started_art_at_28_weeks_of_preg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ no_ar_vs_dispensed_for_infant <dbl> NA, 0, NA, 0, NA, 0, NA, 0, NA,…
$ ar_vs_dispensed_for_infant <dbl> NA, 1, NA, 2, NA, 0, NA, 1, NA,…
The data covers geographic regions, healthcare facilities, time periods, patient demographics, test results, preventive therapies, antiretroviral drugs, and more. More information about the dataset is in the appendix section.
The key variables we will be considering are:
region
: The geographical region or area where the data was collected or is being analyzed.
period
: A specific time period associated with the data, often used for temporal analysis.
previous_negative
: The number of patients who visited the healthcare facility in that quarter that had prior negative HIV tests.
previous_positive
: The number of patients (as above) with prior positive HIV tests.
new_negative
: The number of patients newly testing negative for HIV.
new_positive
: The number of patients newly testing positive for HIV.
In this lesson, we will aggregate the data by quarter and summarize changes in HIV test results.
{gt}
{gt}
’s flexibility, efficiency, and power make it a formidable package for creating tables in R. We’ll explore some of it’s core features in this lesson.
The {gt}
package contains a set of functions that take raw data as input and output a nicely formatted table for further analysis and reporting.
To effectively leverage the {gt}
package, we first need to wrangle our data into an appropriate summarized form.
In the code chunk below, we use {dplyr} functions to summarize HIV testing in select Malawi testing centers by quarter. We first group the data by time period, then sum case counts across multiple variables using across()
:
# Variables to summarize
<- c("new_positive", "previous_positive", "new_negative", "previous_negative")
cols
# Create summary by quarter
<- hiv_malawi %>%
hiv_malawi_summary group_by(period) %>%
summarize(
across(all_of(cols), sum) # Summarize all columns
)
hiv_malawi_summary
This aggregates the data nicely for passing to {gt}
to generate a clean summary table.
To create a simple table from the aggregated data, we can then call the gt()
function:
%>%
hiv_malawi_summary gt()
period | new_positive | previous_positive | new_negative | previous_negative |
---|---|---|---|---|
2019 Q1 | 6199 | 14816 | 284694 | 6595 |
2019 Q2 | 6132 | 15101 | 282249 | 5605 |
2019 Q3 | 5907 | 15799 | 300529 | 6491 |
2019 Q4 | 5646 | 15700 | 291622 | 6293 |
As you can see, the default table formatting is quite plain and unrefined. However, {gt}
provides many options to customize and beautify the table output. We’ll delve into these in the next section.
{gt}
tablesThe {gt} package allows full customization of tables through its “grammar of tables” framework. This is similar to how {ggplot2}’s grammar of graphics works for plotting.
To take full advantage of {gt}, it helps to understand some key components of its grammar.
As seen in the figure from the package website, The main components of a {gt} table are:
Table Header: Contains an optional title and subtitle
Stub: Row labels that identify each row
Stub Head: Optional grouping and labels for stub rows
Column Labels: Headers for each column
Table Body: The main data cells of the table
Table Footer: Optional footnotes and source notes
Understanding this anatomy allows us to systematically construct {gt} tables using its grammar.
The stub is the left section of a table containing the row labels. These provide context for each row’s data.
This image displays the stub component of a {gt}
table, marked with a red square.
In our HIV case table, the period
column holds the row labels we want to use. To generate a stub, we specify this column in gt()
using the rowname_col
argument:
%>%
hiv_malawi_summary gt(rowname_col = "period") %>%
tab_header(
title = "HIV Testing in Malawi",
subtitle = "Q1 to Q2 2019"
%>%
) tab_source_note("Source: Malawi HIV Program")
HIV Testing in Malawi | ||||
---|---|---|---|---|
Q1 to Q2 2019 | ||||
new_positive | previous_positive | new_negative | previous_negative | |
2019 Q1 | 6199 | 14816 | 284694 | 6595 |
2019 Q2 | 6132 | 15101 | 282249 | 5605 |
2019 Q3 | 5907 | 15799 | 300529 | 6491 |
2019 Q4 | 5646 | 15700 | 291622 | 6293 |
Source: Malawi HIV Program |
Note that the column name passed to rowname_col
should be in quotes.
For convenience, let’s save the table to a variable t1
:
<- hiv_malawi_summary %>%
t1 gt(rowname_col = "period") %>%
tab_header(
title = "HIV Testing in Malawi",
subtitle = "Q1 to Q2 2019"
%>%
) tab_source_note("Source: Malawi HIV Program")
t1
HIV Testing in Malawi | ||||
---|---|---|---|---|
Q1 to Q2 2019 | ||||
new_positive | previous_positive | new_negative | previous_negative | |
2019 Q1 | 6199 | 14816 | 284694 | 6595 |
2019 Q2 | 6132 | 15101 | 282249 | 5605 |
2019 Q3 | 5907 | 15799 | 300529 | 6491 |
2019 Q4 | 5646 | 15700 | 291622 | 6293 |
Source: Malawi HIV Program |
To better structure our table, we can group related columns under “spanners”. Spanners are headings that span multiple columns, providing a higher-level categorical organization. We can do this with the tab_spanner()
function.
Let’s create two spanner columns for new and Previous tests. We’ll start with the “New tests” spanner so you can observe the syntax:
%>%
t1 tab_spanner(
label = "New tests",
columns = starts_with("new") # selects columns starting with "new"
)
HIV Testing in Malawi | ||||
---|---|---|---|---|
Q1 to Q2 2019 | ||||
New tests
|
previous_positive | previous_negative | ||
new_positive | new_negative | |||
2019 Q1 | 6199 | 284694 | 14816 | 6595 |
2019 Q2 | 6132 | 282249 | 15101 | 5605 |
2019 Q3 | 5907 | 300529 | 15799 | 6491 |
2019 Q4 | 5646 | 291622 | 15700 | 6293 |
Source: Malawi HIV Program |
The columns
argument lets us select the relevant columns, and the label
argument takes in the span label.
Let’s now add both spanners:
# Save table to t2 for easy access
<- t1 %>%
t2 # First spanner for "New tests"
tab_spanner(
label = "New tests",
columns = starts_with("new")
%>%
) # Second spanner for "Previous tests"
tab_spanner(
label = "Previous tests",
columns = starts_with("prev")
)
t2
HIV Testing in Malawi | ||||
---|---|---|---|---|
Q1 to Q2 2019 | ||||
New tests
|
Previous tests
|
|||
new_positive | new_negative | previous_positive | previous_negative | |
2019 Q1 | 6199 | 284694 | 14816 | 6595 |
2019 Q2 | 6132 | 282249 | 15101 | 5605 |
2019 Q3 | 5907 | 300529 | 15799 | 6491 |
2019 Q4 | 5646 | 291622 | 15700 | 6293 |
Source: Malawi HIV Program |
Note that the tab_spanner
function automatically rearranged the columns in an appropriate way.
Question 1: The Purpose of Spanners
What is the purpose of using “spanner columns” in a {gt}
table?
A. To apply custom CSS styles to specific columns.
B. To create group columns and increase readability.
C. To format the font size of all columns uniformly.
D. To sort the data in ascending order.
Question 2: Spanner Creation
Using the hiv_malawi
data frame, create a {gt}
table that displays a summary of the sum
of new_positive
and previous_positive
cases for each region. Create spanner headers to label these two summary columns. To achieve this, fill in the missing parts of the code below:
<- hiv_malawi %>%
region_summary group_by(region) %>%
summarize(
_________(
c(new_positive, previous_positive),
______
)
)
# Create a gt table with spanner headers
<- region_summary %>%
summary_table_spanners %>%
_____________ ___________(
label = "Positive cases",
________ = c(new_positive, previous_positive)
)
The column names currently contain unneeded prefixes like “new_” and “previous_”. For better readability, we can rename these using cols_label()
.
cols_label()
takes a set of old names to match (on the left side of a tilde, ~
) and new names to replace them with (on the right side of the tilde). We can use contains()
to select columns with “positive” or “negative”:
<- t2 %>%
t3 cols_label(
contains("positive") ~ "Positive",
contains("negative") ~ "Negative"
)
t3
HIV Testing in Malawi | ||||
---|---|---|---|---|
Q1 to Q2 2019 | ||||
New tests
|
Previous tests
|
|||
Positive | Negative | Positive | Negative | |
2019 Q1 | 6199 | 284694 | 14816 | 6595 |
2019 Q2 | 6132 | 282249 | 15101 | 5605 |
2019 Q3 | 5907 | 300529 | 15799 | 6491 |
2019 Q4 | 5646 | 291622 | 15700 | 6293 |
Source: Malawi HIV Program |
This relabels the columns in a cleaner way.
cols_label()
accepts several column selection helpers like contains()
, starts_with()
, ends_with()
etc. These come from the tidyselect package and provide flexibility in renaming.
cols_label()
has more identification function like contains()
that work in a similar manner that are identical to the tidyselect
helpers, these also include :
starts_with()
: Starts with an exact prefix.
ends_with()
: Ends with an exact suffix.
contains()
: Contains a literal string.
matches()
: Matches a regular expression.
num_range()
: Matches a numerical range like x01, x02, x03.
These helpers are useful especially in the case of multiple columns selection.
More on the cols_label()
function can be found here: https://gt.rstudio.com/reference/cols_label.html
Question 3: Column labels
Which function is used to change the labels or names of columns in a {gt}
table?
A. `tab_header()`
B. `tab_style()`
C. `tab_options()`
D. `tab_relabel()`
Let’s take the same data we started with in the beginning of this lesson and instead of only grouping by period(quarters), let’s group by both period
and region
. We will do this to illustrate the power of summarization features in {gt}
: summary tables.
{gt} reminder - Summary Rows This image shows the summary rows component of a {gt}
table, clearly indicated within a red square. Summary rows, provide aggregated data or statistical summaries of the data contained in the corresponding columns.
First let’s recreate the data:
<- hiv_malawi %>%
summary_data_2 group_by(
# Note the order of the variables we group by.
region,
period%>%
) summarise(
across(all_of(cols), sum)
)
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.
summary_data_2
The order in the group_by()
function affects the row groups in the {gt}
table.
Second, let’s re-incorporate all the changes we’ve done previously into this table:
# saving the progress to the t4 object
<- summary_data_2 %>%
t4 gt(rowname_col = "period", groupname_col = "region") %>%
tab_header(
title = "Sum of HIV Tests in Malawi",
subtitle = "from Q1 2019 to Q4 2019"
%>%
) tab_source_note("Data source: Malawi HIV Program") %>% tab_spanner(
label = "New tests",
columns = starts_with("new") # selects columns starting with "new"
%>%
) # creating the first spanner for the Previous tests
tab_spanner(
label = "Previous tests",
columns = starts_with("prev") # selects columns starting with "prev"
%>%
) cols_label(
# locate ### assign
contains("positive") ~ "Positive",
contains("negative") ~ "Negative"
)
t4
Sum of HIV Tests in Malawi | ||||
---|---|---|---|---|
from Q1 2019 to Q4 2019 | ||||
New tests
|
Previous tests
|
|||
Positive | Negative | Positive | Negative | |
Central Region | ||||
2019 Q1 | 2004 | 123018 | 3682 | 2562 |
2019 Q2 | 1913 | 116443 | 3603 | 1839 |
2019 Q3 | 1916 | 127799 | 4002 | 2645 |
2019 Q4 | 1691 | 124728 | 3754 | 1052 |
Northern Region | ||||
2019 Q1 | 664 | 36196 | 1197 | 675 |
2019 Q2 | 582 | 35315 | 1084 | 590 |
2019 Q3 | 570 | 36850 | 1191 | 542 |
2019 Q4 | 519 | 34322 | 1132 | 346 |
Southern Region | ||||
2019 Q1 | 3531 | 125480 | 9937 | 3358 |
2019 Q2 | 3637 | 130491 | 10414 | 3176 |
2019 Q3 | 3421 | 135880 | 10606 | 3304 |
2019 Q4 | 3436 | 132572 | 10814 | 4895 |
Data source: Malawi HIV Program |
Now, what if we want to visualize on the table a summary of each variable for every region group? More precisely we want to see the sum and the mean for the 4 columns we have for each region.
Remember that our 4 columns of interest are : new_positive
, previous_positive
, new_negative
, and previous_negative
. We only changed the labels of these columns in the {gt}
table and not in the data.frame itself, so we can use the names of these columns to tell {gt}
where to apply the summary function. Additionally, we already stored the names of these 4 columns in the object cols
so we will use it again here.
In order to achieve this we will use the handy function summary_rows
where we explicitly provide the columns that we want summarized, and the functions we want to summarize with, in our case it’s sum
and mean
. Note that we assign the name of the new row(unquoted) a function name (“quoted”).
<- t4 %>%
t5 summary_rows(
columns = cols, #using columns = 3:6 also works
fns = list(
TOTAL = "sum",
AVERAGE = "mean"
)
)
t5
Sum of HIV Tests in Malawi | ||||
---|---|---|---|---|
from Q1 2019 to Q4 2019 | ||||
New tests
|
Previous tests
|
|||
Positive | Negative | Positive | Negative | |
Central Region | ||||
2019 Q1 | 2004 | 123018 | 3682 | 2562 |
2019 Q2 | 1913 | 116443 | 3603 | 1839 |
2019 Q3 | 1916 | 127799 | 4002 | 2645 |
2019 Q4 | 1691 | 124728 | 3754 | 1052 |
sum | 7524.00 | 491988.00 | 15041.00 | 8098.00 |
mean | 1881.00 | 122997.00 | 3760.25 | 2024.50 |
Northern Region | ||||
2019 Q1 | 664 | 36196 | 1197 | 675 |
2019 Q2 | 582 | 35315 | 1084 | 590 |
2019 Q3 | 570 | 36850 | 1191 | 542 |
2019 Q4 | 519 | 34322 | 1132 | 346 |
sum | 2335.00 | 142683.00 | 4604.00 | 2153.00 |
mean | 583.75 | 35670.75 | 1151.00 | 538.25 |
Southern Region | ||||
2019 Q1 | 3531 | 125480 | 9937 | 3358 |
2019 Q2 | 3637 | 130491 | 10414 | 3176 |
2019 Q3 | 3421 | 135880 | 10606 | 3304 |
2019 Q4 | 3436 | 132572 | 10814 | 4895 |
sum | 14025.00 | 524423.00 | 41771.00 | 14733.00 |
mean | 3506.25 | 131105.75 | 10442.75 | 3683.25 |
Data source: Malawi HIV Program |
Question 4 : Summary Rows
What is the correct answer (or answers) if you had to summarize the standard deviation of the rows of columns new_positive
and previous_negative
only?
A. Use summary_rows()
with the columns
argument set to new_positive
and previous_negative
and fns
argument set to “sd”.
# Option A
%>%
your_data summary_rows(
columns = c("new_positive", "previous_negative"),
fns = "sd"
)
B. Use summary_rows()
with the columns
argument set to new_positive
and previous_negative
and fns
argument set to “summarize(sd)”.
# Option B
%>%
your_data summary_rows(
columns = c("new_positive", "previous_negative"),
fns = summarize(sd)
)
C. Use summary_rows()
with the columns
argument set to new_positive
and previous_negative
and fns
argument set to list(SD = "sd")
.
# Option C
%>%
your_data summary_rows(
columns = c("new_positive", "previous_negative"),
fns = list(SD = "sd")
)
D. Use summary_rows()
with the columns
argument set to new_positive
and previous_negative
and fns
argument set to “standard_deviation”.
# Option D
%>%
your_data summary_rows(
columns = c("new_positive", "previous_negative"),
fns = "standard_deviation"
)
In today’s lesson, we got down to business with data tables in R using {gt}
. We started by setting some clear goals, introduced the packages we’ll be using, and met our dataset. Then, we got our hands dirty by creating straightforward tables. We learned how to organize our data neatly using spanner columns and tweaking column labels to make things crystal clear and coherent. Then wrapped up with some nifty table summaries. These are the nuts and bolts of table-making in R and {gt}
, and they’ll be super handy as we continue our journey on creating engaging and informative tables in R.
# Solutions are where the numbered lines are
# summarize data first
<- hiv_malawi %>%
district_summary group_by(region) %>%
summarize(
across( #1
c(new_positive, previous_positive),
#2
sum
)
)
# Create a gt table with spanner headers
<- district_summary %>%
summary_table_spanners gt() %>% #3
tab_spanner( #4
label = "Positive cases",
columns = c(new_positive, previous_positive) #5
)
The following team members contributed to this lesson:
Tom Mock, “The Definite Cookbook of {gt}” (2021), The Mockup Blog, https://themockup.blog/static/resources/gt-cookbook.html#introduction.
Tom Mock, “The Grammar of Tables” (May 16, 2020), The Mockup Blog, https://themockup.blog/posts/2020-05-16-gt-a-grammar-of-tables/#add-titles.
RStudio, “Introduction to Creating gt Tables,” Official {gt} Documentation, https://gt.rstudio.com/articles/intro-creating-gt-tables.html.
Fleming, Jessica A., Alister Munthali, Bagrey Ngwira, John Kadzandira, Monica Jamili-Phiri, Justin R. Ortiz, Philipp Lambach, et al. 2019. “Maternal Immunization in Malawi: A Mixed Methods Study of Community Perceptions, Programmatic Considerations, and Recommendations for Future Planning.” Vaccine 37 (32): 4568–75. https://doi.org/10.1016/j.vaccine.2019.06.020.
This work is licensed under the Creative Commons Attribution Share Alike license.