Data cleaning is the process of transforming raw, “messy” data into reliable data that can be properly analyzed. This entails identifying inaccurate, incomplete, or improbable data points and resolving data inconsistencies or errors, as well as renaming variable names to make them more clear and simple to manipulate.
Data cleaning tasks can often be tedious and time-consuming. A common joke among data analysts goes “80% of data science is cleaning the data and the remaining 20% is complaining about cleaning the data.” But data cleaning is an essential step of the data analysis process. A little bit of cleaning at the start of the data analysis process will go a long way to improving the quality of your analyses and the ease with which these analyses can be done. And a range of packages and functions in R can significantly simplify the data cleaning process.
In this lesson, we will begin to look at a typical data cleaning pipeline in R. The cleaning steps covered here are unlikely to be exactly what is needed for your own datasets, but they will certainly be a good starting point.
Let’s get started!
You can diagnose dataset issues that warrant data cleaning through functions such as:
visdat::vis_dat()
inspectdf::inspect_cat()
inspectdf::inspect_num()
gtsummary::tbl_summary()
The packages loaded below will be required for this lesson:
The primary dataset comes from a study conducted in Zambezia, Mozambique.
It examines individual factors associated with the time until non-adherence to HIV treatment.
## New names:
## Rows: 1413 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): Sex, Age_35, Education, Occupation, Civil...status dbl (9): patient_id,
## District, Health unit, Age at ART initiation, WHO statu... lgl (1): NA
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `Occupation...9` -> `Occupation`
visdat::vis_dat()
The vis_dat()
function from the visdat
package allows us to visualize data types and missing values.
Each row of the dataset is represented by a line on the plot.
Let’s create a mock dataset containing 8 patients and their COVID-19 diagnosis and recovery information.
covid_pat <- tribble(
~Patient_ID, ~Age, ~Gender, ~Covid_Strain, ~Exposure, ~Recovery_Days,
1, 25, "Male", "Alpha", NA, 10,
2, 32, "Female", "Delta", "Hospital", 15,
3, 45, "Male", "Beta", "Travel", 7,
4, 19, "Female", "Omicron", NA, 21,
5, 38, "Male", "Alpha", "Unknown", 14,
6, 55, "Female", NA, "Community", 19,
7, 28, "Female", "Omicron", NA, 8,
8, NA, "Female", "Omicron", "Travel", 26
)
covid_pat
## # A tibble: 8 × 6
## Patient_ID Age Gender Covid_Strain Exposure Recovery_Days
## <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 25 Male Alpha <NA> 10
## 2 2 32 Female Delta Hospital 15
## 3 3 45 Male Beta Travel 7
## 4 4 19 Female Omicron <NA> 21
## 5 5 38 Male Alpha Unknown 14
## 6 6 55 Female <NA> Community 19
## 7 7 28 Female Omicron <NA> 8
## 8 8 NA Female Omicron Travel 26
vis_dat()
function to visualize data
types and missing values.Different colors are used to illustrate string variables (pink), numeric variables (blue), and missing values (gray).
Now, let’s look at our real-world dataset, which is much larger and messy.
As a reminder, here is our non_adherence
dataset.
The NA
column is completely empty.
Several variables have a lot of missing values.
Some variable names are unclear/uncleaned.
(NOTE: Answers are at the bottom of the page. Try to answer the questions yourself before checking.)
The following dataset was adapted from a study that investigated missed opportunities for HIV testing among patients newly presenting for HIV care at a Swiss university hospital. The full dataset can be found here.
Use the vis_dat()
function to get a visual
representation of the data. What potential issues can you spot based on
the output?
## Rows: 201 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): sex, clage, origine, adquired, chronic, acute, latepresent, reason...
## dbl (4): cd4, cd4_category, numberconsult, missed_ops
## lgl (1): NaN.
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
inspectdf
‣ Using the inspectdf::inspect_cat()
and
inspectdf::inspect_num()
functions to get variable
summaries in rich graphical form
‣ inspect_cat()
: for a summary of categorical
variables
## # A tibble: 3 × 5
## col_name cnt common common_pcnt levels
## <chr> <int> <chr> <dbl> <named list>
## 1 Covid_Strain 5 Omicron 37.5 <tibble [5 × 3]>
## 2 Exposure 5 <NA> 37.5 <tibble [5 × 3]>
## 3 Gender 2 Female 62.5 <tibble [2 × 3]>
‣ Important information is hidden in the levels
column
‣ To see this, let’s use the show_plot()
function with
inspect_cat()
‣ Use show_plot()
to visualize categorical
distributions
‣ Now let’s run this on the non_adherence
dataset:
‣ We can identify some data issues with categorical variables
Variable Age_35
: Inconsistent use of
capitalization
Variable sex
: Inconsistent naming
Variable NA
: Empty, requires cleaning
inspectdf for categorical variables
Complete the following code to obtain a visual summary of the
categorical variables in the missed_ops
dataset.
How many potential data issues can you spot?
‣ 3
‣ For numeric variables, use inspect_num()
‣ Common issue: Categorical variables encoded as numeric
Use inspect_num
to create a histograms of your numeric
variables in the missed_ops
dataset. Are the numeric
variable types correct?
gtsummary::tbl_summary()
‣ tbl_summary()
from gtsummary
is ideal for
obtaining frequencies and percentages of our categorical variables
‣ We will see a tibble format and a snapshot of the output for our
non_adherence
dataset
## NOTE: YOU DO NOT NEED THE AS_TIBBLE() FUNCTION USED IN THE VIDEO
gtsummary::tbl_summary(non_adherence)
Characteristic | N = 1,4131 |
---|---|
patient_id | 7,355 (4,705, 10,098) |
District | |
1 | 925 (65%) |
2 | 488 (35%) |
Health unit | |
1 | 678 (48%) |
2 | 247 (17%) |
3 | 488 (35%) |
Sex | |
F | 1,084 (77%) |
Male | 329 (23%) |
Age_35 | |
over 35 | 437 (31%) |
Under 35 | 976 (69%) |
Age at ART initiation | 30 (25, 37) |
Education | |
None | 128 (20%) |
Primary | 335 (53%) |
Secondary | 153 (24%) |
Technical | 17 (2.7%) |
University | 4 (0.6%) |
Unknown | 776 |
Occupation | |
Accountant | 1 (<0.1%) |
Administrator | 1 (<0.1%) |
Agriculture technician | 3 (0.2%) |
Artist | 1 (<0.1%) |
bartender | 1 (<0.1%) |
Basic service agent | 2 (0.2%) |
Boat captain | 1 (<0.1%) |
Business | 3 (0.2%) |
Commercial | 18 (1.5%) |
cook | 3 (0.2%) |
Correctional Agent | 1 (<0.1%) |
counselor | 1 (<0.1%) |
Domestic | 832 (68%) |
driver | 5 (0.4%) |
Electrician | 5 (0.4%) |
Farmer | 132 (11%) |
Fireman | 1 (<0.1%) |
fisherman | 2 (0.2%) |
Journalist | 1 (<0.1%) |
Judge | 1 (<0.1%) |
Lab technician | 1 (<0.1%) |
Locksmith | 2 (0.2%) |
Mechanic | 3 (0.2%) |
Military | 3 (0.2%) |
Mill operator | 2 (0.2%) |
miner | 1 (<0.1%) |
Office worker | 1 (<0.1%) |
Police | 3 (0.2%) |
professor | 11 (0.9%) |
Professor | 35 (2.9%) |
Receptionist | 1 (<0.1%) |
Retired | 2 (0.2%) |
rock cutter | 9 (0.7%) |
seamstress | 1 (<0.1%) |
Secretary | 1 (<0.1%) |
Security | 13 (1.1%) |
self employed | 26 (2.1%) |
Servant | 2 (0.2%) |
shop owner | 1 (<0.1%) |
Student | 68 (5.6%) |
Superior technician | 1 (<0.1%) |
Taxi driver | 1 (<0.1%) |
teacher | 1 (<0.1%) |
Teacher | 4 (0.3%) |
tracter driver | 1 (<0.1%) |
Traditional healer | 2 (0.2%) |
Truck driver | 1 (<0.1%) |
Water technician | 1 (<0.1%) |
Wood worker | 4 (0.3%) |
Worker | 3 (0.2%) |
Unknown | 193 |
Civil...status | |
Divorced | 1 (<0.1%) |
Single | 215 (21%) |
Stable Union | 712 (71%) |
Widowed | 73 (7.3%) |
Unknown | 412 |
WHO status at ART initiaiton | |
1 | 757 (55%) |
2 | 187 (14%) |
3 | 364 (27%) |
4 | 60 (4.4%) |
Unknown | 45 |
BMI_Initiation_Art | 20.1 (18.1, 22.2) |
Unknown | 588 |
CD4_Initiation_Art | 343 (233, 538) |
Unknown | 674 |
regimen.1 | |
1 | 1 (<0.1%) |
2 | 59 (4.2%) |
3 | 531 (38%) |
4 | 42 (3.0%) |
5 | 10 (0.7%) |
6 | 758 (54%) |
7 | 12 (0.8%) |
Nr_of_pills_day | |
1 | 758 (54%) |
2 | 585 (41%) |
3 | 70 (5.0%) |
NA | 0 (NA%) |
Unknown | 1,413 |
1 Median (Q1, Q3); n (%) |
‣ The summary includes frequencies and percentages for categorical variables, and the median and interquartile range for numeric variables
Use tbl_summary()
to output a summary of your
missed_ops
dataset. Can you identify any additional data
issues?
Characteristic | N = 2011 |
---|---|
sex | |
Female | 75 (37%) |
M | 63 (31%) |
Male | 63 (31%) |
clage | |
>50 | 33 (16%) |
18-29 | 56 (28%) |
30-49 | 112 (56%) |
origine | |
Others | 29 (14%) |
Sub Saharan Africa | 66 (33%) |
Western World | 106 (53%) |
adquired | |
Heterosexual | 114 (57%) |
IV drug use | 9 (4.5%) |
MSM | 68 (34%) |
Other | 10 (5.0%) |
chronic | 40 (20%) |
cd4 | 293 (147, 452) |
cd4_category | |
1 | 66 (33%) |
2 | 54 (27%) |
3 | 42 (21%) |
4 | 39 (19%) |
acute | |
Not acute | 171 (85%) |
yes | 30 (15%) |
latepresent | 106 (53%) |
reasonsdx | |
AIDS defining illness | 21 (10%) |
Disease for which the prevalence of HIV is >1% | 59 (29%) |
Epi risk | 21 (10%) |
Epidemiological risk | 21 (10%) |
Introduction of immunosuppressive treatment | 1 (0.5%) |
Patient initiated | 28 (14%) |
Pregnancy | 14 (7.0%) |
Suspicion of acute HIV infection | 36 (18%) |
testlocal | |
Anonymous consultation | 26 (20%) |
CHUV/PMU outpatient | 41 (31%) |
Primary care physician | 64 (49%) |
Unknown | 70 |
numberconsult | 1.00 (0.00, 3.00) |
consultcat | |
>5 previous consultations | 29 (14%) |
1-5 previous consultations | 82 (41%) |
No previous consultation | 90 (45%) |
cattimenegtestdx | |
HIV testing the previous year | 28 (14%) |
Last HIV test >1 before diagnosis | 54 (27%) |
No previous HIV testing | 119 (59%) |
missed_ops | 0.00 (0.00, 3.00) |
Unknown | 2 |
NaN. | 0 (NA%) |
Unknown | 201 |
1 n (%); Median (Q1, Q3) |
By familiarizing ourselves with the data, we have been able to identify some potential problems that may need to be addressed before the data are used in an analysis.
The problems we identified in the non_adherence
dataset
were:
NA
variable is a completely empty column, and can
be removed.Age_35
variable has Under 35
and
over 35
, the Sex
variable has F
and Male
, and the Occupation
variable has
inconsistent capitalization.)District
, Health unit
,
WHO status at ART initiaiton
, regimen.1
, and
Nr_of_pills_day
.And as you have seen, the actual code needed to do this data exploration is very little; other R developers have done the difficult work for us by building amazing packages to quickly scan datasets and identify issues.
From the next lesson, we will begin to take on these identified issues one by one, starting with the problem of inconsistent, messy variable names.
See you in the next lesson!
inspect_cat()
acute
has 2 levels: Not acute
and yes
. This should be standardized.sex
has 3 levels: Female
,
Male
, and M
. The M
should be
changed to Male
.inspect_num()
cd4category
should be a factor
variabaletbl_summary()
Characteristic | N = 2011 |
---|---|
sex | |
Female | 75 (37%) |
M | 63 (31%) |
Male | 63 (31%) |
clage | |
>50 | 33 (16%) |
18-29 | 56 (28%) |
30-49 | 112 (56%) |
origine | |
Others | 29 (14%) |
Sub Saharan Africa | 66 (33%) |
Western World | 106 (53%) |
adquired | |
Heterosexual | 114 (57%) |
IV drug use | 9 (4.5%) |
MSM | 68 (34%) |
Other | 10 (5.0%) |
chronic | 40 (20%) |
cd4 | 293 (147, 452) |
cd4_category | |
1 | 66 (33%) |
2 | 54 (27%) |
3 | 42 (21%) |
4 | 39 (19%) |
acute | |
Not acute | 171 (85%) |
yes | 30 (15%) |
latepresent | 106 (53%) |
reasonsdx | |
AIDS defining illness | 21 (10%) |
Disease for which the prevalence of HIV is >1% | 59 (29%) |
Epi risk | 21 (10%) |
Epidemiological risk | 21 (10%) |
Introduction of immunosuppressive treatment | 1 (0.5%) |
Patient initiated | 28 (14%) |
Pregnancy | 14 (7.0%) |
Suspicion of acute HIV infection | 36 (18%) |
testlocal | |
Anonymous consultation | 26 (20%) |
CHUV/PMU outpatient | 41 (31%) |
Primary care physician | 64 (49%) |
Unknown | 70 |
numberconsult | 1.00 (0.00, 3.00) |
consultcat | |
>5 previous consultations | 29 (14%) |
1-5 previous consultations | 82 (41%) |
No previous consultation | 90 (45%) |
cattimenegtestdx | |
HIV testing the previous year | 28 (14%) |
Last HIV test >1 before diagnosis | 54 (27%) |
No previous HIV testing | 119 (59%) |
missed_ops | 0.00 (0.00, 3.00) |
Unknown | 2 |
NaN. | 0 (NA%) |
Unknown | 201 |
1 n (%); Median (Q1, Q3) |
reasonsdx
, there are
the categories Epidemiological risk
and
Epi risk
which should be a single categorySome material in this lesson was adapted from the following sources:
Batra, Neale, et al. The Epidemiologist R Handbook. 2021.Cleaning data and core functions. https://epirhandbook.com/en/new_pages/cleaning.html
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr/ (website), https://github.com/ropensci/skimr/.
The following team members contributed to this lesson: