Chapter 4 Clean Data
4.1 Replace a value with NA
Use dplyr::na_if()
if you have a value coded in your data (e.g., 999) that you want to convert to NA
=dplyr::tribble(~name, ~bday_month,
example_data"Ryan", 10,
"Z", 3,
"Jen", 999,
"Tristin", 999,
"Cassidy", 6)
example_data
## # A tibble: 5 x 2
## name bday_month
## <chr> <dbl>
## 1 Ryan 10
## 2 Z 3
## 3 Jen 999
## 4 Tristin 999
## 5 Cassidy 6
$bday_month=na_if(example_data$bday_month, 999) #example doing one column at a time
example_data
example_data
## # A tibble: 5 x 2
## name bday_month
## <chr> <dbl>
## 1 Ryan 10
## 2 Z 3
## 3 Jen NA
## 4 Tristin NA
## 5 Cassidy 6
%>% # can also pass the data to mutate and do it the tidyverse way
example_data mutate(bday_month=na_if(bday_month, 999))
## # A tibble: 5 x 2
## name bday_month
## <chr> <dbl>
## 1 Ryan 10
## 2 Z 3
## 3 Jen NA
## 4 Tristin NA
## 5 Cassidy 6
4.2 Replace NA’s with a value
tidyr::replace_na()
is very useful if you have some NA’s in your data and you want to fill them in with some value.
=tibble::tribble(~name, ~fav_color, ~fav_food,
example_data"Ryan", "green", "Mexican",
"Cassidy", "blue", NA,
"Z", NA, NA,
"Tristin", "purple", NA,
"Tarika", NA, NA,
"Jen", NA, "Italian")
example_data
## # A tibble: 6 x 3
## name fav_color fav_food
## <chr> <chr> <chr>
## 1 Ryan green Mexican
## 2 Cassidy blue <NA>
## 3 Z <NA> <NA>
## 4 Tristin purple <NA>
## 5 Tarika <NA> <NA>
## 6 Jen <NA> Italian
# replace NA's in one col
::replace_na(example_data$fav_food, "MISSING") tidyr
## [1] "Mexican" "MISSING" "MISSING" "MISSING" "MISSING" "Italian"
# replace in multiple columns
%>% mutate(across(c(fav_color, fav_food), replace_na, "MISSING")) example_data
## # A tibble: 6 x 3
## name fav_color fav_food
## <chr> <chr> <chr>
## 1 Ryan green Mexican
## 2 Cassidy blue MISSING
## 3 Z MISSING MISSING
## 4 Tristin purple MISSING
## 5 Tarika MISSING MISSING
## 6 Jen MISSING Italian
4.3 Identify columns or rows with Missing values
is.na()
is the base R way to identify, in a TRUE/FALSE manner, whether or not there are missing values in a vector
<- c(1,2,3,NA)
y
is.na(y) # returns a vector (F F F T)
## [1] FALSE FALSE FALSE TRUE
4.4 Find the percentage of a variable that is missing
Sometimes necessary to check before conducting an analysis. This requires my package, legaldmlab
::count_missing
?legaldmlab
%>%
mtcars select(hp:drat) %>%
::count_missing() legaldmlab
## # A tibble: 2 x 3
## variable missing_count percent_missing
## <chr> <int> <chr>
## 1 hp 0 0.0%
## 2 drat 0 0.0%
4.6 Dropping Missing values from the data set
Use tidyr::drop_na()
to remove rows with missing values.
=dplyr::tribble(~name, ~bday_month, ~car,
example_data"Ryan", 10, "kia",
"Z", NA, "toyota",
"Jen", NA, NA,
"Tristin", 999, NA,
"Cassidy", 6, "honda")
::kable(example_data) knitr
name | bday_month | car |
---|---|---|
Ryan | 10 | kia |
Z | NA | toyota |
Jen | NA | NA |
Tristin | 999 | NA |
Cassidy | 6 | honda |
%>% drop_na() # with nothing specified, it drops ALL variables that have >=1 missing value example_data
## # A tibble: 2 x 3
## name bday_month car
## <chr> <dbl> <chr>
## 1 Ryan 10 kia
## 2 Cassidy 6 honda
%>% drop_na(car) # drops only rows with values missing in the specified column example_data
## # A tibble: 3 x 3
## name bday_month car
## <chr> <dbl> <chr>
## 1 Ryan 10 kia
## 2 Z NA toyota
## 3 Cassidy 6 honda