Factors are an important data class for representing and working with
categorical variables in R.
We will learn how to create factors and how to manipulate them with
functions from the forcats
package, a part of the
tidyverse.
You understand what factors are and how they differ from characters in R.
You are able to modify the order of factor levels.
You are able to modify the value of factor levels.
We will use a dataset on HIV mortality in Colombia from 2010 to 2016
It hosted on the open data platform ‘Datos Abiertos Colombia.’
You can learn more here.
Each row corresponds to an individual who passed away from AIDS or AIDS-related-complications.
## # A tibble: 445 × 25
## municipality_type death_location birth_date birth_year birth_month birth_day
## <chr> <chr> <date> <dbl> <chr> <dbl>
## 1 Municipal head Hospital/clinic 1956-05-26 1956 May 26
## 2 Municipal head Hospital/clinic 1983-10-10 1983 Oct 10
## 3 Municipal head Hospital/clinic 1967-11-22 1967 Nov 22
## 4 Municipal head Home/address 1964-03-14 1964 Mar 14
## 5 Municipal head Hospital/clinic 1960-06-27 1960 Jun 27
## 6 Municipal head Hospital/clinic 1982-03-23 1982 Mar 23
## 7 Municipal head Hospital/clinic 1964-12-09 1964 Dec 9
## 8 Municipal head Hospital/clinic 1975-01-15 1975 Jan 15
## 9 Municipal head Hospital/clinic 1988-02-15 1988 Feb 15
## 10 Municipal head Hospital/clinic NA NA <NA> NA
## # ℹ 435 more rows
## # ℹ 19 more variables: death_year <dbl>, death_month <chr>, death_day <dbl>,
## # age_at_death <dbl>, gender <chr>, education_level <chr>, occupation <chr>,
## # racial_id <chr>, health_insurance_status <chr>, marital_status <chr>,
## # municipality_name <chr>, primary_cause_death_description <chr>,
## # primary_cause_death_code <chr>, secondary_cause_death_description <chr>,
## # secondary_cause_death_code <chr>, tertiary_cause_death_description <chr>, …
More efficient storage in R
Some statistical functions like lm()
, require
categorical variables to be input as factors
Control over the order of categories or levels.
Problem: the x-axis arranges months alphabetically, not chronologically!
Solution: Creating a factor using the
factor()
function:
hiv_mort_modified <-
hiv_mort %>%
mutate(birth_month = factor(x = birth_month,
levels = c("Jan", "Feb", "Mar", "Apr", "May","Jun", "Jul", "Aug", "Sep", "Oct", "Nov","Dec")))
x
argument takes the original character column,
birth_month
levels
argument takes in the desired sequence of
months.
Inspecting the data type:
## [1] "factor"
## [1] "character"
The new factor variable will respect the defined order in other contexts as well
For example, compare how the count()
function
displays frequency tables:
## # A tibble: 13 × 2
## birth_month n
## <chr> <int>
## 1 Apr 33
## 2 Aug 33
## 3 Dec 39
## 4 Feb 28
## 5 Jan 35
## 6 Jul 33
## 7 Jun 42
## 8 Mar 31
## 9 May 30
## 10 Nov 31
## 11 Oct 42
## 12 Sep 33
## 13 <NA> 35
## # A tibble: 13 × 2
## birth_month n
## <fct> <int>
## 1 Jan 35
## 2 Feb 28
## 3 Mar 31
## 4 Apr 33
## 5 May 30
## 6 Jun 42
## 7 Jul 33
## 8 Aug 33
## 9 Sep 33
## 10 Oct 42
## 11 Nov 31
## 12 Dec 39
## 13 <NA> 35
You can use factor without levels. It just uses default (alphabetical) arrangement of levels
## [1] "factor"
## [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"
Using the hiv_mort
dataset, convert the
gender
variable to a factor with the levels “Female” and
“Male”, in that order.
What errors are you able to spot in the following code chunk? What are the consequences of these errors?
What is one main advantage of using factors over characters for categorical data in R?
forcats
Factors are useful, but they can sometimes be a little tedious to manipulate
forcats
offers a set of functions that make factor
manipulation simpler
We’ll explore four functions, but there are many others, we encourage you to explore the forcats website on your own time here!
fct_relevel()
function is used to manually change the
order of factor levels. For example, to visualize the frequency of
individuals by municipality type:fct_relevel().
Here’s
how:hiv_mort_pop_center_first <-
hiv_mort %>%
mutate(municipality_type = fct_relevel(municipality_type,"Populated center"))
Syntax: we pass the factor variable as the first argument, and the level as the second argument.
Now when we plot:
“Populated center” level is now first.
We can move the “Populated center” level to a different position
with the after
argument:
hiv_mort %>%
mutate(municipality_type = fct_relevel(municipality_type, "Populated center",after = 2)) %>%
# pipe directly into to plot to visualize change
ggplot() +
geom_bar(aes(x = municipality_type))
specify:
the factor
the level to move
and use the after
argument to define the
position
We can also move multiple levels at a
time by providing these levels to fct_relevel()
:
Below we arrange all the factor levels for municipality type in our desired order:
hiv_mort %>%
mutate(municipality_type = fct_relevel(municipality_type, "Scattered rural", "Populated center", "Municipal head")) %>%
ggplot() +
geom_bar(aes(x = municipality_type))
hiv_mort %>%
mutate(municipality = factor(municipality_type,
levels = c("Scattered rural", "Populated center", "Municipal head"))) %>%
ggplot() +
geom_bar(aes(x = municipality_type))
Using the hiv_mort
dataset, convert the
death_location
variable to a factor such that
‘Home/address’ is the first level. Then create a bar plot that shows the
count of individuals in the dataset by death_location.
hiv_mort %>%
mutate(death_location = fct_relevel(death_location, "Home/address")) %>%
ggplot() +
geom_bar(aes(x = death_location))
fct_reorder()
is used to reorder the levels of a factor
based on the values of another variable.
To illustrate, let’s make a summary table with number of deaths, mean and median age at death for each municipality:
summary_per_muni <-
hiv_mort %>%
group_by(municipality_name) %>%
summarise(n_deceased = n(),
mean_age_death = mean(age_at_death, na.rm = T),
med_age_death = median(age_at_death, na.rm = T))
summary_per_muni
## # A tibble: 25 × 4
## municipality_name n_deceased mean_age_death med_age_death
## <chr> <int> <dbl> <dbl>
## 1 Aguadas 2 42 42
## 2 Anserma 15 37.4 37.5
## 3 Aranzazu 2 37.5 37.5
## 4 Belalcázar 4 38.8 41
## 5 Chinchiná 62 43.6 42.5
## 6 Filadelfia 5 42.6 43
## 7 La Dorada 46 41.0 41
## 8 La Merced 3 27 28
## 9 Manizales 199 41.0 41
## 10 Manzanares 3 38.3 34
## # ℹ 15 more rows
When plotting one of the variables, we may want to arrange the factor levels by that numeric variable. For example, to order municipality by the mean age column:
summary_per_muni_reordered <-
summary_per_muni %>%
mutate(municipality_name = fct_reorder(.f = municipality_name, .x = mean_age_death))
The syntax is:
.f
- the factor to reorder.x
- the numeric vector determining the new orderWe can now plot a nicely arranged bar chart:
Starting with the summary_per_muni
data frame, reorder
the municipality (municipality_name
) by the
med_age_death
column and plot the reordered bar chart.
summary_per_muni_reordered <-
summary_per_muni %>%
mutate(municipality_name = fct_reorder(.f = municipality_name, .x = med_age_death))
ggplot(summary_per_muni_reordered) +
geom_col(aes(y = municipality_name, x = med_age_death))
.fun
argumentSometimes we want the categories in our plot to appear in a specific
order that is determined by a summary statistic. For example, consider
the box plot of birth_year
by
marital_status
:
The boxplot displays the median birth_year
for each
category of marital status as a line in the middle of each box. We might
want to arrange the marital_status categories in order of these medians.
But if we create a summary table with medians, like we did before with
summary_per_muni
, we can’t create a box plot with it (go
look at the summary_per_muni
data frame to verify this
yourself).
This is where the .fun
argument of
fct_reorder()
comes in. The .fun
argument
allows us to specify a summary function that will be used to calculate
the new order of the levels:
hiv_mort_arranged_marital <-
hiv_mort %>%
mutate(marital_status = fct_reorder(.f = marital_status, .x = birth_year, .fun = median, na.rm = TRUE))
In this code, we are reordering the marital_status
factor based on the median of birth_year.
We include the
argument na.rm = TRUE
to ignore NA values when calculating
the median.
Now, when we create our box plot, the marital_status
categories are ordered by the median birth_year
:
We can see that individuals with the marital status “cohabiting” tend to be the youngest (they were born in the latest years).
Using the hiv_mort
dataset, make a boxplot of
birth_year
by health_insurance_status
, where
the health_insurance_status
categories are arranged by the
median birth_year
.
hiv_mort_arranged_insurance <-
hiv_mort %>%
mutate(health_insurance_status = fct_reorder(.f =health_insurance_status, .x = birth_year, .fun = median, na.rm = TRUE))
ggplot(hiv_mort_arranged_insurance, aes(y = health_insurance_status, x = birth_year)) +
geom_boxplot()
The fct_recode()
function allows us to manually change
the values of factor levels. This function can be especially helpful
when you need to rename categories or when you want to merge multiple
categories into one.
For example, we can rename ‘Municipal head’ to ‘City’ in the
municipality_type
variable:
hiv_mort_muni_recode <- hiv_mort %>%
mutate(municipality_type = fct_recode(municipality_type, "City" = "Municipal head"))
# View the change
levels(hiv_mort_muni_recode$municipality_type)
## [1] "City" "Populated center" "Scattered rural"
In the above code, fct_recode()
takes two arguments: the
factor variable you want to change (municipality_type
), and
the set of name-value pairs that define the recoding. The new level
(“City”) is on the left of the equals sign, and the old level
(“Municipal head”) is on the right.
fct_recode()
is particularly useful for compressing
multiple categories into fewer levels:
We can explore this using the education_level variable. Currently it has six categories:
## # A tibble: 6 × 2
## education_level n
## <chr> <int>
## 1 No information 88
## 2 None 22
## 3 Post-secondary 29
## 4 Preschool 3
## 5 Primary 187
## 6 Secondary 116
For simplicity, let’s group them into just three categories - primary & below, secondary & above and other:
hiv_mort_educ_simple <-
hiv_mort %>%
mutate(education_level = fct_recode(education_level,"primary & below" = "Primary",
"primary & below" = "Preschool",
"secondary & above" = "Secondary",
"secondary & above" = "Post-secondary",
"others" = "No information",
"others" = "None"))
This condenses the categories nicely:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 others 110
## 2 secondary & above 145
## 3 primary & below 190
For good measure, we can arrange the levels in a reasonable order, with “others” as the last level:
hiv_mort_educ_sorted <-
hiv_mort_educ_simple %>%
mutate(education_level = fct_relevel(education_level,
"primary & below",
"secondary & above",
"others"))
This condenses the categories nicely:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 primary & below 190
## 2 secondary & above 145
## 3 others 110
Using the hiv_mort
dataset, convert death_location to a
factor.
Then use fct_recode()
to rename ‘Public way’ in
death_location to ‘Public place’. Plot the frequency counts of the
updated variable.
hiv_mort_death_location <-
hiv_mort %>%
mutate(death_location = fct_recode(death_location,"Public place" = "Public way"))
count(hiv_mort_death_location, death_location)
## # A tibble: 4 × 2
## death_location n
## <fct> <int>
## 1 Health center/post 2
## 2 Home/address 57
## 3 Hospital/clinic 385
## 4 Public place 1
fct_recode vs case_when/if_else
You might question why we need fct_recode()
when we can
utilize case_when()
or if_else()
or even
recode()
to substitute specific values. The issue is that
these other functions can disrupt your factor variable.
To illustrate, let’s say we choose to use case_when()
to
make a modification to the education_level
variable of the
hiv_mort_educ_sorted
data frame.
As a quick reminder, that the education_level
variable
is a factor with three levels, arranged in a specified order, with
“primary & below” first and “others” last:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 primary & below 190
## 2 secondary & above 145
## 3 others 110
Say we wanted to replace the “others” with “other”, removing the “s”. We can write:
hiv_mort_educ_other <-
hiv_mort_educ_sorted %>%
mutate(education_level = if_else(education_level == "others",
"other", education_level))
After this operation, the variable is no longer a factor:
## [1] "character"
If we then create a table or plot, our order is disrupted and reverts to alphabetical order, with “other” as the first level:
## # A tibble: 3 × 2
## education_level n
## <chr> <int>
## 1 other 110
## 2 primary & below 190
## 3 secondary & above 145
However, if we had used fct_recode()
for recoding, we
wouldn’t face this issue:
hiv_mort_educ_other_fct <-
hiv_mort_educ_simple %>%
mutate(education_level = fct_recode(education_level, "other" = "others"))
The variable remains a factor:
## [1] "factor"
And if we create a table or a plot, our order is preserved: primary, secondary, then other:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 other 110
## 2 secondary & above 145
## 3 primary & below 190
Sometimes, we have too many levels for a display table or plot, and we want to lump the least frequent levels into a single category, typically called ‘Other’.
This is where the convenience function fct_lump()
comes
in.
In the below example, we lump less frequent municipalities into ‘Other’, preserving just the top 5 most frequent municipalities:
hiv_mort_lump_muni <- hiv_mort %>%
mutate(municipality_name = fct_lump(municipality_name, n = 5))
ggplot(hiv_mort_lump_muni, aes(x = municipality_name)) +
geom_bar()
In the usage above, the parameter n = 5
means that the
five most frequent municipalities are preserved, and the rest are lumped
into ‘Other’.
We can provide a custom name for the other category with the
other_level
argument. Below we use the name
"Other municipalities"
.
hiv_mort_lump_muni_other_name <- hiv_mort %>%
mutate(municipality_name = fct_lump(municipality_name, n = 5,
other_level = "Other municipalities"))
ggplot(hiv_mort_lump_muni_other_name, aes(x = municipality_name)) +
geom_bar()
In this way, fct_lump()
is a handy tool for condensing
factors with many infrequent levels into a more manageable number of
categories.
Starting with the hiv_mort
dataset, use
fct_lump()
to create a bar chart with the frequency of the
10 most common occupations.
Lump the remaining occupation into an ‘Other’ category.
Put occupation
on the y-axis, not the x-axis, to avoid
label overlap.
hiv_mort_lump_occupation <- hiv_mort %>%
mutate(occupation = fct_lump(occupation, n = 10,
other_level = "Other occupation"))
ggplot(hiv_mort_lump_occupation, aes(y = occupation)) +
geom_bar()
Congrats on getting to the end. In this lesson, you learned details
about the data class, factors, and how to manipulate
them using basic operations such as fct_relevel()
,
fct_reorder()
fct_recode()
, and
fct_lump()
.
While these covered common tasks such as reordering, recoding, and collapsing levels, this introduction only scratches the surface of what’s possible with the forcats package. Do explore more on the forcats website.
Now that you understand the basics of working with factors, you are equipped to properly represent your categorical data in R for downstream analysis and visualization.
Errors:
Consequences:
Any rows with the values “May”, “Nov”, or “Aug” for death_month will be converted to NA in the new death_month variable. If you create plots, ggplot will drop these levels with only NA values.
The other two statements are not true.
If you want to apply string operations like substr(), strsplit(), paste(), etc., it’s actually more straightforward to use character vectors than factors.
And while many statistical functions expect factors, not characters, for categorical predictors, this does not make them more “accurate”.
# Convert death_location to a factor with 'Home/address' as the first level
hiv_mort <- hiv_mort %>%
mutate(death_location = fct_relevel(death_location, "Home/address"))
# Create a bar plot of death_location
ggplot(hiv_mort, aes(x = death_location)) +
geom_bar() +
labs(title = "Count of Individuals by Death Location")
# Reorder municipality_name by med_age_death and plot
summary_per_muni <- summary_per_muni %>%
mutate(municipality_name = fct_reorder(municipality_name, med_age_death))
# Create a bar plot of reordered municipality_name
ggplot(summary_per_muni, aes(x = municipality_name)) +
geom_bar() +
labs(title = "Municipality by Median Age at Death")
# Create a boxplot of birth_year by health_insurance_status
ggplot(hiv_mort, aes(x = health_insurance_status, y = birth_year)) +
geom_boxplot() +
labs(title = "Boxplot of Birth Year by Health Insurance Status") +
coord_flip() # To display categories on the y-axis
# Convert death_location to a factor and rename 'Public way' to 'Public place'
hiv_mort <- hiv_mort %>%
mutate(death_location = fct_recode(death_location, "Public place" = "Public way"))
# Plot frequency counts of the updated variable
ggplot(hiv_mort, aes(x = death_location)) +
geom_bar() +
labs(title = "Frequency Counts of Death Location")
# Create a bar chart with the frequency of the 10 most common occupations
hiv_mort <- hiv_mort %>%
mutate(occupation = fct_lump(occupation, n = 10, other_level = "Other"))
# Create the bar plot with occupation on the y-axis
ggplot(hiv_mort, aes(x = occupation)) +
geom_bar() +
labs(title = "Frequency of the 10 Most Common Occupations") +
coord_flip() # To display categories on the y-axis
The variables in the dataset are:
municipality
: general municipal location of the
patient [chr]
death_location
: location where the patient died
[chr]
birth_date
: full date of birth, formatted
“YYYY-MM-DD” [date]
birth_year
: year when the patient was born
[dbl]
birth_month
: month when the patient was born
[chr]
birth_day
: day when the patient was born
[dbl]
death_year
: year when the patient died
[dbl]
death_month
: month when the patient died
[chr]
death_day
: day when the patient died [dbl]
gender
: gender of the patient [chr]
education_level
: highest level of education attained
by patient [chr]
occupation
: occupation of patient [chr]
racial_id
: race of the patient [chr]
municipality_code
: specific municipal location of
the patient [chr]
primary_cause_death_description
: primary cause of
the patient’s death [chr]
primary_cause_death_code
: code of the primary cause
of death [chr]
secondary_cause_death_description
: secondary cause
of the patient’s death [chr]
secondary_cause_death_code
: code of the secondary
cause of death [chr]
tertiary_cause_death_description
: tertiary cause of
the patient’s death [chr]
tertiary_cause_death_code
: code of the tertiary
cause of death [chr]
quaternary_cause_death_description
: quaternary cause
of the patient’s death [chr]
quaternary_cause_death_code
: code of the quaternary
cause of death [chr]
The following team members contributed to this lesson: