1 Introduction

Data cleaning is the process of transforming raw, “messy” data into reliable data that can be properly analyzed. This entails identifying inaccurate, incomplete, or improbable data points and resolving data inconsistencies or errors, as well as renaming variable names to make them more clear and simple to manipulate.

Data cleaning tasks can often be tedious and time-consuming. A common joke among data analysts goes “80% of data science is cleaning the data and the remaining 20% is complaining about cleaning the data.” But data cleaning is an essential step of the data analysis process. A little bit of cleaning at the start of the data analysis process will go a long way to improving the quality of your analyses and the ease with which these analyses can be done. And a range of packages and functions in R can significantly simplify the data cleaning process.

In this lesson, we will begin to look at a typical data cleaning pipeline in R. The cleaning steps covered here are unlikely to be exactly what is needed for your own datasets, but they will certainly be a good starting point.

Let’s get started!

2 Learning objectives

You can diagnose dataset issues that warrant data cleaning through functions such as:
- visdat::vis_dat()
- inspectdf::inspect_cat()
- inspectdf::inspect_num()
- gtsummary::tbl_summary()

3 Packages

The packages loaded below will be required for this lesson:

if(!require("pacman")) install.packages("pacman")
pacman::p_load(visdat, 
               inspectdf,
               gtsummary,
               tidyverse)

4 Dataset

The primary dataset comes from a study conducted in Zambezia, Mozambique.
It examines individual factors associated with the time until non-adherence to HIV treatment.

non_adherence <- read_csv(here("data/non_adherence_moz.csv"))

## New names:
## Rows: 1413 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): Sex, Age_35, Education, Occupation, Civil...status dbl (9): patient_id,
## District, Health unit, Age at ART initiation, WHO statu... lgl (1): NA
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `Occupation...9` -> `Occupation`

5 Visualizing Missing Data with `visdat::vis_dat()`

The vis_dat() function from the visdat package allows us to visualize data types and missing values.
Each row of the dataset is represented by a line on the plot.
Let’s create a mock dataset containing 8 patients and their COVID-19 diagnosis and recovery information.

covid_pat <- tribble(
  ~Patient_ID, ~Age, ~Gender,  ~Covid_Strain, ~Exposure,    ~Recovery_Days,
  1,           25,   "Male",   "Alpha",        NA,          10,
  2,           32,   "Female", "Delta",       "Hospital",   15,
  3,           45,   "Male",   "Beta",        "Travel",      7,
  4,           19,   "Female", "Omicron",      NA,          21,
  5,           38,   "Male",   "Alpha",       "Unknown",    14,
  6,           55,   "Female",  NA,           "Community",  19,
  7,           28,   "Female", "Omicron",      NA,           8,
  8,           NA,   "Female", "Omicron",     "Travel",     26
)
covid_pat

## # A tibble: 8 × 6
##   Patient_ID   Age Gender Covid_Strain Exposure  Recovery_Days
##        <dbl> <dbl> <chr>  <chr>        <chr>             <dbl>
## 1          1    25 Male   Alpha        <NA>                 10
## 2          2    32 Female Delta        Hospital             15
## 3          3    45 Male   Beta         Travel                7
## 4          4    19 Female Omicron      <NA>                 21
## 5          5    38 Male   Alpha        Unknown              14
## 6          6    55 Female <NA>         Community            19
## 7          7    28 Female Omicron      <NA>                  8
## 8          8    NA Female Omicron      Travel               26

Let’s use the vis_dat() function to visualize data types and missing values.

vis_dat(covid_pat)

Different colors are used to illustrate string variables (pink), numeric variables (blue), and missing values (gray).
Now, let’s look at our real-world dataset, which is much larger and messy.

As a reminder, here is our non_adherence dataset.

non_adherence

vis_dat(non_adherence)

The NA column is completely empty.
Several variables have a lot of missing values.
Some variable names are unclear/uncleaned.

(NOTE: Answers are at the bottom of the page. Try to answer the questions yourself before checking.)

The following dataset was adapted from a study that investigated missed opportunities for HIV testing among patients newly presenting for HIV care at a Swiss university hospital. The full dataset can be found here.

missed_ops <- read_csv(here("data/HIV_missed_ops.csv"))

Use the vis_dat() function to get a visual representation of the data. What potential issues can you spot based on the output?

missed_ops <- read_csv(here("data/HIV_missed_ops.csv"))

## Rows: 201 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): sex, clage, origine, adquired, chronic, acute, latepresent, reason...
## dbl  (4): cd4, cd4_category, numberconsult, missed_ops
## lgl  (1): NaN.
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

vis_dat(missed_ops)

6 Visualization of summary statistics with `inspectdf`

‣ Using the inspectdf::inspect_cat() and inspectdf::inspect_num() functions to get variable summaries in rich graphical form

‣ inspect_cat(): for a summary of categorical variables

inspectdf::inspect_cat(covid_pat)

## # A tibble: 3 × 5
##   col_name       cnt common  common_pcnt levels          
##   <chr>        <int> <chr>         <dbl> <named list>    
## 1 Covid_Strain     5 Omicron        37.5 <tibble [5 × 3]>
## 2 Exposure         5 <NA>           37.5 <tibble [5 × 3]>
## 3 Gender           2 Female         62.5 <tibble [2 × 3]>

‣ Important information is hidden in the levels column

‣ To see this, let’s use the show_plot() function with inspect_cat()

‣ Use show_plot() to visualize categorical distributions

inspect_cat(covid_pat) %>%
 show_plot()

‣ Now let’s run this on the non_adherence dataset:

inspect_cat(non_adherence) %>%
 show_plot()

‣ We can identify some data issues with categorical variables

Variable Age_35: Inconsistent use of capitalization
Variable sex: Inconsistent naming
Variable NA: Empty, requires cleaning

inspectdf for categorical variables

Complete the following code to obtain a visual summary of the categorical variables in the missed_ops dataset.

inspect_cat(missed_ops) %>% 
  show_plot()

How many potential data issues can you spot?

‣ 3

‣ For numeric variables, use inspect_num()

inspect_num(non_adherence) %>%
  show_plot()

‣ Common issue: Categorical variables encoded as numeric

Use inspect_num to create a histograms of your numeric variables in the missed_ops dataset. Are the numeric variable types correct?

inspect_num(missed_ops) %>%
  show_plot()

7 Exploring categorical variables with `gtsummary::tbl_summary()`

‣ tbl_summary() from gtsummary is ideal for obtaining frequencies and percentages of our categorical variables

‣ We will see a tibble format and a snapshot of the output for our non_adherence dataset

## NOTE: YOU DO NOT NEED THE AS_TIBBLE() FUNCTION USED IN THE VIDEO
gtsummary::tbl_summary(non_adherence)

Characteristic	N = 1,413¹
patient_id	7,355 (4,705, 10,098)
District
1	925 (65%)
2	488 (35%)
Health unit
1	678 (48%)
2	247 (17%)
3	488 (35%)
Sex
F	1,084 (77%)
Male	329 (23%)
Age_35
over 35	437 (31%)
Under 35	976 (69%)
Age at ART initiation	30 (25, 37)
Education
None	128 (20%)
Primary	335 (53%)
Secondary	153 (24%)
Technical	17 (2.7%)
University	4 (0.6%)
Unknown	776
Occupation
Accountant	1 (<0.1%)
Administrator	1 (<0.1%)
Agriculture technician	3 (0.2%)
Artist	1 (<0.1%)
bartender	1 (<0.1%)
Basic service agent	2 (0.2%)
Boat captain	1 (<0.1%)
Business	3 (0.2%)
Commercial	18 (1.5%)
cook	3 (0.2%)
Correctional Agent	1 (<0.1%)
counselor	1 (<0.1%)
Domestic	832 (68%)
driver	5 (0.4%)
Electrician	5 (0.4%)
Farmer	132 (11%)
Fireman	1 (<0.1%)
fisherman	2 (0.2%)
Journalist	1 (<0.1%)
Judge	1 (<0.1%)
Lab technician	1 (<0.1%)
Locksmith	2 (0.2%)
Mechanic	3 (0.2%)
Military	3 (0.2%)
Mill operator	2 (0.2%)
miner	1 (<0.1%)
Office worker	1 (<0.1%)
Police	3 (0.2%)
professor	11 (0.9%)
Professor	35 (2.9%)
Receptionist	1 (<0.1%)
Retired	2 (0.2%)
rock cutter	9 (0.7%)
seamstress	1 (<0.1%)
Secretary	1 (<0.1%)
Security	13 (1.1%)
self employed	26 (2.1%)
Servant	2 (0.2%)
shop owner	1 (<0.1%)
Student	68 (5.6%)
Superior technician	1 (<0.1%)
Taxi driver	1 (<0.1%)
teacher	1 (<0.1%)
Teacher	4 (0.3%)
tracter driver	1 (<0.1%)
Traditional healer	2 (0.2%)
Truck driver	1 (<0.1%)
Water technician	1 (<0.1%)
Wood worker	4 (0.3%)
Worker	3 (0.2%)
Unknown	193
Civil...status
Divorced	1 (<0.1%)
Single	215 (21%)
Stable Union	712 (71%)
Widowed	73 (7.3%)
Unknown	412
WHO status at ART initiaiton
1	757 (55%)
2	187 (14%)
3	364 (27%)
4	60 (4.4%)
Unknown	45
BMI_Initiation_Art	20.1 (18.1, 22.2)
Unknown	588
CD4_Initiation_Art	343 (233, 538)
Unknown	674
regimen.1
1	1 (<0.1%)
2	59 (4.2%)
3	531 (38%)
4	42 (3.0%)
5	10 (0.7%)
6	758 (54%)
7	12 (0.8%)
Nr_of_pills_day
1	758 (54%)
2	585 (41%)
3	70 (5.0%)
NA	0 (NA%)
Unknown	1,413
¹ Median (Q1, Q3); n (%)

‣ The summary includes frequencies and percentages for categorical variables, and the median and interquartile range for numeric variables

Use tbl_summary() to output a summary of your missed_ops dataset. Can you identify any additional data issues?

gtsummary::tbl_summary(missed_ops)

Characteristic	N = 201¹
sex
Female	75 (37%)
M	63 (31%)
Male	63 (31%)
clage
>50	33 (16%)
18-29	56 (28%)
30-49	112 (56%)
origine
Others	29 (14%)
Sub Saharan Africa	66 (33%)
Western World	106 (53%)
adquired
Heterosexual	114 (57%)
IV drug use	9 (4.5%)
MSM	68 (34%)
Other	10 (5.0%)
chronic	40 (20%)
cd4	293 (147, 452)
cd4_category
1	66 (33%)
2	54 (27%)
3	42 (21%)
4	39 (19%)
acute
Not acute	171 (85%)
yes	30 (15%)
latepresent	106 (53%)
reasonsdx
AIDS defining illness	21 (10%)
Disease for which the prevalence of HIV is >1%	59 (29%)
Epi risk	21 (10%)
Epidemiological risk	21 (10%)
Introduction of immunosuppressive treatment	1 (0.5%)
Patient initiated	28 (14%)
Pregnancy	14 (7.0%)
Suspicion of acute HIV infection	36 (18%)
testlocal
Anonymous consultation	26 (20%)
CHUV/PMU outpatient	41 (31%)
Primary care physician	64 (49%)
Unknown	70
numberconsult	1.00 (0.00, 3.00)
consultcat
>5 previous consultations	29 (14%)
1-5 previous consultations	82 (41%)
No previous consultation	90 (45%)
cattimenegtestdx
HIV testing the previous year	28 (14%)
Last HIV test >1 before diagnosis	54 (27%)
No previous HIV testing	119 (59%)
missed_ops	0.00 (0.00, 3.00)
Unknown	2
NaN.	0 (NA%)
Unknown	201
¹ n (%); Median (Q1, Q3)

8 Wrap up!

By familiarizing ourselves with the data, we have been able to identify some potential problems that may need to be addressed before the data are used in an analysis.

The problems we identified in the non_adherence dataset were:

The NA variable is a completely empty column, and can be removed.
Variables names are not clean and must be standardized.
There are inconsistencies in the way the levels have been coded (e.g., the Age_35 variable has Under 35 and over 35, the Sex variable has F and Male, and the Occupation variable has inconsistent capitalization.)
Many numeric variables should be coded as factors: District, Health unit, WHO status at ART initiaiton, regimen.1, and Nr_of_pills_day.

And as you have seen, the actual code needed to do this data exploration is very little; other R developers have done the difficult work for us by building amazing packages to quickly scan datasets and identify issues.

From the next lesson, we will begin to take on these identified issues one by one, starting with the problem of inconsistent, messy variable names.

See you in the next lesson!

Answer Key

Q: Spotting data issues with `vis_dat()`

vis_dat(missed_ops)

The column NaN is completely empty

Q: Spotting data issues with `inspect_cat()`

inspect_cat(missed_ops) %>%
  show_plot()

The variable acute has 2 levels: Not acute and yes. This should be standardized.
The varibale sex has 3 levels: Female, Male, and M. The M should be changed to Male.

Q: Variable types with `inspect_num()`

inspect_num(missed_ops) %>%
  show_plot()

The variable cd4category should be a factor variabale

Q: Spotting data issues with `tbl_summary()`

tbl_summary(missed_ops)

Characteristic	N = 201¹
sex
Female	75 (37%)
M	63 (31%)
Male	63 (31%)
clage
>50	33 (16%)
18-29	56 (28%)
30-49	112 (56%)
origine
Others	29 (14%)
Sub Saharan Africa	66 (33%)
Western World	106 (53%)
adquired
Heterosexual	114 (57%)
IV drug use	9 (4.5%)
MSM	68 (34%)
Other	10 (5.0%)
chronic	40 (20%)
cd4	293 (147, 452)
cd4_category
1	66 (33%)
2	54 (27%)
3	42 (21%)
4	39 (19%)
acute
Not acute	171 (85%)
yes	30 (15%)
latepresent	106 (53%)
reasonsdx
AIDS defining illness	21 (10%)
Disease for which the prevalence of HIV is >1%	59 (29%)
Epi risk	21 (10%)
Epidemiological risk	21 (10%)
Introduction of immunosuppressive treatment	1 (0.5%)
Patient initiated	28 (14%)
Pregnancy	14 (7.0%)
Suspicion of acute HIV infection	36 (18%)
testlocal
Anonymous consultation	26 (20%)
CHUV/PMU outpatient	41 (31%)
Primary care physician	64 (49%)
Unknown	70
numberconsult	1.00 (0.00, 3.00)
consultcat
>5 previous consultations	29 (14%)
1-5 previous consultations	82 (41%)
No previous consultation	90 (45%)
cattimenegtestdx
HIV testing the previous year	28 (14%)
Last HIV test >1 before diagnosis	54 (27%)
No previous HIV testing	119 (59%)
missed_ops	0.00 (0.00, 3.00)
Unknown	2
NaN.	0 (NA%)
Unknown	201
¹ n (%); Median (Q1, Q3)

Example issue: for the variable reasonsdx, there are the categories Epidemiological risk and Epi risk which should be a single category

References

Some material in this lesson was adapted from the following sources:

Batra, Neale, et al. The Epidemiologist R Handbook. 2021.Cleaning data and core functions. https://epirhandbook.com/en/new_pages/cleaning.html
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr/ (website), https://github.com/ropensci/skimr/.

Contributors

The following team members contributed to this lesson:

AMANDA MCKINLEY
R Developer and Instructor, the GRAPH Network

KENE DAVID NWOSU
Data analyst, the GRAPH Network
Passionate about world improvement

LAURE VANCAUWENBERGHE
Data analyst, the GRAPH Network
A firm believer in science for good, striving to ally programming, health and education

Data Cleaning 1: Data Diagnostics

Camille Beatrice Valera

Kene David Nwosu

Amanda McKinley

Laure Vancauwenberghe

2024-11-21

1 Introduction

2 Learning objectives

3 Packages

4 Dataset

5 Visualizing Missing Data with `visdat::vis_dat()`

6 Visualization of summary statistics with `inspectdf`

7 Exploring categorical variables with `gtsummary::tbl_summary()`

8 Wrap up!

Answer Key

Q: Spotting data issues with `vis_dat()`

Q: Spotting data issues with `inspect_cat()`

Q: Variable types with `inspect_num()`

Q: Spotting data issues with `tbl_summary()`

References

Contributors

Data Cleaning 1: Data Diagnostics

Camille Beatrice Valera

Kene David Nwosu

Amanda McKinley

Laure Vancauwenberghe

2024-11-21

1 Introduction

2 Learning objectives

3 Packages

4 Dataset

5 Visualizing Missing Data with visdat::vis_dat()

6 Visualization of summary statistics with inspectdf

7 Exploring categorical variables with gtsummary::tbl_summary()

8 Wrap up!

Answer Key

Q: Spotting data issues with vis_dat()

Q: Spotting data issues with inspect_cat()

Q: Variable types with inspect_num()

Q: Spotting data issues with tbl_summary()

References

Contributors

5 Visualizing Missing Data with `visdat::vis_dat()`

6 Visualization of summary statistics with `inspectdf`

7 Exploring categorical variables with `gtsummary::tbl_summary()`

Q: Spotting data issues with `vis_dat()`

Q: Spotting data issues with `inspect_cat()`

Q: Variable types with `inspect_num()`

Q: Spotting data issues with `tbl_summary()`