Chapter 3 Wrangle Data
This chapter contains useful tips on wrangling (i.e., manipulating) data. If you need to know to do to things like create new variables, split one variable into multiple variables, pivot a data set from wide to long, etc., look no further.
If you want a pretty good intro tutorial to the dplyr
package, click here
3.1 Joining or Splitting
Joining and splitting data is pretty straightforward….
3.1.1 Whole Data Sets
The code below is from this excellent tutorial
set.seed(2018)
=data.frame(customer_id=c(1:10),
df1product=sample(c('toaster','TV','Dishwasher'),10,replace = TRUE))
=data.frame(customer_id=c(sample(df1$customer_id, 5)),state=sample(c('New York','California'),5,replace = TRUE))
df2
=tibble::as_tibble(df1)
df1=tibble::as_tibble(df2)
df2
# df1 =left table
# df2= right table
Inner join - retains only rows with values that appear in both tables, and matches by keys.
If you’re joining two Qualtrics surveys together, this is most likely the one you want to use (e.g. matching by participant name, and only keeping rows in the joined data set for participants that have responses logged in both survey 1 and survey 2
%>% inner_join(df2,by='customer_id') df1
## # A tibble: 5 x 3
## customer_id product state
## <int> <chr> <chr>
## 1 1 Dishwasher New York
## 2 3 Dishwasher New York
## 3 6 toaster New York
## 4 8 Dishwasher New York
## 5 9 Dishwasher New York
Left join - returns everything in the left, and rows with matching keys in the right
%>% left_join(df2,by='customer_id') df1
## # A tibble: 10 x 3
## customer_id product state
## <int> <chr> <chr>
## 1 1 Dishwasher New York
## 2 2 Dishwasher <NA>
## 3 3 Dishwasher New York
## 4 4 toaster <NA>
## 5 5 TV <NA>
## 6 6 toaster New York
## 7 7 toaster <NA>
## 8 8 Dishwasher New York
## 9 9 Dishwasher New York
## 10 10 TV <NA>
Right join - returns everything in the right, and rows with matching keys in the left
%>% right_join(df2,by='customer_id') df1
## # A tibble: 5 x 3
## customer_id product state
## <int> <chr> <chr>
## 1 1 Dishwasher New York
## 2 3 Dishwasher New York
## 3 6 toaster New York
## 4 8 Dishwasher New York
## 5 9 Dishwasher New York
# note: example if the customer id column was named something different in the second df
#df1 %>% left_join(df2,by=c('customer_id'='name2'))
Full join - retain all rows from both tables, and join matching keys in both right and left
%>% full_join(df2,by='customer_id') df1
## # A tibble: 10 x 3
## customer_id product state
## <int> <chr> <chr>
## 1 1 Dishwasher New York
## 2 2 Dishwasher <NA>
## 3 3 Dishwasher New York
## 4 4 toaster <NA>
## 5 5 TV <NA>
## 6 6 toaster New York
## 7 7 toaster <NA>
## 8 8 Dishwasher New York
## 9 9 Dishwasher New York
## 10 10 TV <NA>
Anti join - returns all rows in the left that do not have matching keys in the right
%>% anti_join(df2,by='customer_id') df1
## # A tibble: 5 x 2
## customer_id product
## <int> <chr>
## 1 2 Dishwasher
## 2 4 toaster
## 3 5 TV
## 4 7 toaster
## 5 10 TV
3.2 Selecting/extracting specific variables with select()
Sometimes when working with a data set, you want to work with a few specific variables. For instance, maybe you want to view a graph of only reverse-coded variables (which start with the prefix “r”); or maybe you want to create a subset of your data that has a few specific variables removed. For this you can use dplyr::select()
and its associated helper commands
select()
can be thought of as “extract”; it tells R to identify and “extract” a specific variable (or variables)
=mtcars
cars
# select one column
%>% select(mpg)
cars
# select multiple columns, if they are all next to one another
%>% select(mpg:hp)
cars
# select multiple columns by name (when not next to one another) by defining them in a vector
%>% select(c(mpg, hp, wt))
cars
# select only variables that start with a certain prefix/character/pattern/etc.
%>% select(starts_with("d"))
cars
# ...or columns that end with a certain prefix/etc.
%>% select(ends_with("t"))
cars
# ...or contains a certain pattern or string
%>% select(contains("se"))
cars
# select ALL OF the variables in a data set that match those of a pre-defined vector
# first define the names in a vector
=c("hp", "drat", "gear", "carb")
vars
#now use helper
%>% select(all_of(vars))
cars
# select ANY OF the variables in a pre-defined vector
=c("hp", "drat", "watermelon", "grilled_cheese") # only the first two will be in the data
vars_2
%>% select(any_of(vars_2)) # only (and all of) the variables actually PRESENT in the data are pulled
cars
# select only variables of a certain class or type
%>% select(where(is.numeric))
cars %>% select(where(is.character)) cars
Other examples can be seen on THIS LINK for a simple but detailed guide.
3.3 Advanced Filtering techniques
All info here and the code taken from this link; credit goes to Suzan Baert.
3.3.1 Filter based on partial match
Combine filter
with str_detect
to search for pattern matches in a column
3.3.2 Filtering based on multiple conditions
You can filter for more than one value at once, or combine operators to do searches with conditions….
Goal | Code |
---|---|
Return only rows where both conditions are met | filter(condition1, condition2) |
Return all rows where condition 1 is true but condition 2 is not | filter(condition1, !condition2) |
Return rows where condition 1 and/or condition 2 is met | filter(condition1 | condition2) |
Return all rows where only one of the conditions is met, and not when both are met | filter(xor(condition1, condition2) |
The sample code will return all rows with a bodywt above 100 and either have a sleep_total
above 15 or are not part of the Carnivora order.
%>%
msleep select(name, order, sleep_total:bodywt) %>%
filter(bodywt > 100, (sleep_total > 15 | order != "Carnivora"))
3.3.3 Filtering across multiple columns
From Susan Baert’s blog page:
You have three options for how to do this:
filter_all()
, which filters columns based on your further instructionsfilter_if()
, which requires a boolean to indicate which columns to filter on. If that is true, the filter instructions will be followed.filter_at()
, which requires you to specify columns inside avars
argument for which the filtering will be done.
Retain all rows with the pattern match “Ca” inside. Useful for when you want to search for a key pattern (or number) across multiple columns.
%>%
msleep select(name:order, sleep_total, -vore) %>%
filter_all(any_vars(str_detect(., pattern = "Ca")))
%>%
msleep select(name, sleep_total:bodywt) %>%
filter_all(any_vars(. < 0.1))
# You can also switch any_vars to all_vars to filter across the whole data frame
%>%
msleep select(name, sleep_total:bodywt, -awake) %>%
filter_all(all_vars(. > 1))
What if you want to filter specific column types in your data frame, like just date columns, to find a specific date across multiple date columns? Or filter across multiple number columns to find every instance of a specific number? filter_if
is better than filter_all
here because the latter would return a filter search across the whole data frame. By using filter_if
, we can get more specific about which columns to search through.
%>%
msleep select(name:order, sleep_total:sleep_rem) %>%
filter_if(is.character, any_vars(is.na(.)))
The last command, filter_at
, does things differently. It doesn’t filter all columns at once; and you do not have to specify the column type or which columns. This one allows you to indicate which columns to search through like you would within any select statement.
Example: Search through the columns that start with “sleep”, for values in all of those columns above .5.
%>%
msleep select(name, sleep_total:sleep_rem, brainwt:bodywt) %>%
filter_at(vars(contains("sleep")), all_vars(.>5))
3.4 If-then and Case-when
3.4.1 If-then
The premise of an if/then or if/else statement is simple: “If condition 1 is satisfied, perform x operation; if not, then do y”
%>% mutate(power_level=ifelse(mtcars$hp<350, "Low", "High")) %>% head() mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb power_level
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Low
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Low
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Low
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Low
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Low
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Low
This line of code effectively says: if the length in Sepal.Length is >5, set new variable = to “short”; else, set it to “long”
3.4.2 Case-when
When you have 3+ conditions, it’s easier to use case-when. This is a more simple and straightforward approach than nesting multiple if-else commands
= case_when(
My_vector~ value1,
Condition1 ~ value2,
Condition2 ~ value3
Condition3 TRUE ~ valueForEverythingElse #catch all for things that don't meet the above conditions
)
Example:
%>% mutate(size= case_when(cyl==4 ~ "small",
mtcars ==6 ~ "medium",
cyl==8 ~ "large")) %>%
cylselect(c(cyl,size)) %>% head()
## cyl size
## Mazda RX4 6 medium
## Mazda RX4 Wag 6 medium
## Datsun 710 4 small
## Hornet 4 Drive 6 medium
## Hornet Sportabout 8 large
## Valiant 6 medium
3.5 Conditional replacement of values
The following code is useful if you want to replace a value in one column, and the replacement is conditional upon the value in another column.
%>%
mpg mutate(across(.cols = c(displ, cty, hwy),
.fns = ~case_when(cyl == 4L ~ as.numeric(NA),
TRUE ~ as.numeric(.x))))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 audi a4 NA 1999 4 auto(l5) f NA NA p comp~
## 2 audi a4 NA 1999 4 manual(~ f NA NA p comp~
## 3 audi a4 NA 2008 4 manual(~ f NA NA p comp~
## 4 audi a4 NA 2008 4 auto(av) f NA NA p comp~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manual(~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p comp~
## 8 audi a4 quattro NA 1999 4 manual(~ 4 NA NA p comp~
## 9 audi a4 quattro NA 1999 4 auto(l5) 4 NA NA p comp~
## 10 audi a4 quattro NA 2008 4 manual(~ 4 NA NA p comp~
## # ... with 224 more rows
%>%
test mutate(across(.cols = c(rank),
.fns = ~case_when(is.na(participant_score) ~ as.numeric(NA),
TRUE ~ as.numeric(.x))))
3.6 Merging variables
Sometimes you’ll have multiple variables and you want to collapse them into a single variable. The pmin()
command is useful for this.
example_data=tribble(~A,~B,~C,
1,NA,NA,
2,NA,NA,
3,NA,NA,
NA,4,NA,
NA,5,NA,
NA,6,NA,
NA,NA,7,
NA,NA,8,
NA,NA,9)
example_data %>%
mutate(accept_reject =
pmin(A,B,C,na.rm = TRUE))
3.7 Apply a function to multiple variables at once
You can either specify each column individually, like above, or tell R to identify columns for you based on their type or their name. This requires adding in one additional verb–either contains() or where() depending on what you want to do.
Two simple examples:
# turn multiple variables into factors
=dplyr::tribble(~color, ~car,
ex_data"red", "corvette",
"blue", "chevelle",
"green", "camaro",
"red", "corvette",
"green", "chevelle",
"yellow", "gto")
::glimpse(ex_data) dplyr
## Rows: 6
## Columns: 2
## $ color <chr> "red", "blue", "green", "red", "green", "yellow"
## $ car <chr> "corvette", "chevelle", "camaro", "corvette", "chevelle", "gto"
%>% mutate(across(c(color, car),factor)) ex_data
## # A tibble: 6 x 2
## color car
## <fct> <fct>
## 1 red corvette
## 2 blue chevelle
## 3 green camaro
## 4 red corvette
## 5 green chevelle
## 6 yellow gto
# round multiple columns to 1 decimal place
%>% mutate(across(c(disp:qsec),round,1)) %>% head() mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.9 2.6 16.5 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.9 2.9 17.0 0 1 4 4
## Datsun 710 22.8 4 108 93 3.9 2.3 18.6 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.1 3.2 19.4 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.1 3.4 17.0 0 0 3 2
## Valiant 18.1 6 225 105 2.8 3.5 20.2 1 0 3 1
3.8 Pivoting (i.e., transposing) data
3.8.1 Condense multiple rows into a single column (pivot wide to long)
Rearranging data like this can make it easier to work with and analyze. Example below from my gradebook for stats (exported from Canvas), with fake names.
The command structure is as follows:
pivot_longer( # Transpose LENGTHWISE by....
cols = everything(), # Taking ALL variable names...
names_to="variable", # ...and dumping them into this new variable/column
values_to="missing_count") #...and placing their values in this other new column
NOTE!!! Pivoting data from wide to long like this expands the number of rows to make a matrix so that (for example, each student now has as a row for each assignment). Therefore, you can only pivot longways (or wide) ONCE, otherwise you will make duplicates.
If you need to pivot multiple columns, just include all of the columns in one single pivot; do not use two separate, back to back pivot commands.
Example:
=tibble::tribble(
gradebook~Student, ~Homework.1, ~Homework.2, ~Homework.3, ~Homework.4, ~Homework.5, ~Quiz.1, ~Quiz.2, ~Quiz.3, ~Quiz.4, ~Final,
"Bob", 19L, 0L, 13, 16, 0L, 21, 7L, 15, 17.5, 33,
"Jane", 17L, 19L, 16, 16.5, 25L, 21.5, 19L, 14.75, 9.5, 39.5,
"John", 19L, 19L, 14.5, 19.5, 25L, 21, 21L, 18.5, 17, 46.5
)
head(gradebook)
## # A tibble: 3 x 11
## Student Homework.1 Homework.2 Homework.3 Homework.4 Homework.5 Quiz.1 Quiz.2
## <chr> <int> <int> <dbl> <dbl> <int> <dbl> <int>
## 1 Bob 19 0 13 16 0 21 7
## 2 Jane 17 19 16 16.5 25 21.5 19
## 3 John 19 19 14.5 19.5 25 21 21
## # ... with 3 more variables: Quiz.3 <dbl>, Quiz.4 <dbl>, Final <dbl>
=gradebook %>%
gradebookpivot_longer( # Transpose lengthwise by:
cols = Homework.1:Final, # Taking these variables
names_to="Assignment", # ...and dumping them into this new variable, storing them lengthwise
values_to="Points") #...then place their values in this new column
%>% head() gradebook
## # A tibble: 6 x 3
## Student Assignment Points
## <chr> <chr> <dbl>
## 1 Bob Homework.1 19
## 2 Bob Homework.2 0
## 3 Bob Homework.3 13
## 4 Bob Homework.4 16
## 5 Bob Homework.5 0
## 6 Bob Quiz.1 21
3.9 Turn row names into a column/variable
Use the rownames()
command to turn row names into a variable
=rownames_to_column(mtcars, var = "car")
cars
as_tibble(cars) %>% slice(1:6)
## # A tibble: 6 x 12
## car mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 Hornet Sportab~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
3.10 How to edit/change column names
TWO WAYS TO DO THIS: Use colnames()
(for base R) or rename()
(for tidyverse)
colnames()
pulls up all the column/variable names as a vector. If you want to actually change them, you’ll need to combine this command with something like the sub() or gsub() commands (for base R). I’m going to skip this because…its base R.
To access and change the names faster via tidyverse, run use rename()
rm(list=ls()) # clear R's memory
%>% rename("hurr"="Sepal.Length",
iris "durr"="Sepal.Width",
"abcdefgh"="Species") %>%
head()
## hurr durr Petal.Length Petal.Width abcdefgh
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
If you need to do some really fancy conditional renaming (e.g., changing all variables that start with “r” to start with “rf” instead, to make it more clear that the prefix actually stands for “risk factor” rather than “reverse coded”), you’ll need to use rename_with()
.
This command has two parts to it: the data set, and the function you wish to apply to it (which you put after the ~
)
rename_with(iris, ~ gsub(pattern = ".", replacement = "_", .x, fixed = TRUE)) %>%
head()
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The gsub() function from Base R identifies matching patterns in the data and substitutes them with what you want instead. Think of it like R’s version of Find/Replace from Microsoft Word.
The above line of code thus does the following: 1. First, it checks the column names of the supplied data set (iris
) for a specific pattern (specified in pattern=
) 2. Then it replaces that pattern with your input in replacement=
The great thing about rename_with() is that the .fn (or ~
for short) can take ANY function as input. For example, if you want to add an element to the column names rather than replace something, (e.g., a prefix or suffix), you can change the function to:
rename_with( iris, ~ paste0(.x, "_text")) %>%
head()
## Sepal.Length_text Sepal.Width_text Petal.Length_text Petal.Width_text
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## Species_text
## 1 setosa
## 2 setosa
## 3 setosa
## 4 setosa
## 5 setosa
## 6 setosa
The above line adds a suffix. You can also add a prefix in the exact same way, just by switching the order of the string and the pattern in the paste0 command.
Alternative method to the above This is a second way to do the above. It may appear more simple, but it’s also probably not as theoretically consistent with how the packages were made…..it uses the stringr
package to rename the column names, and stringr
is typically used for editing vectors of strings in a data set.
…so it works, but it’s a little unconventional because you call and edit the column names like you would a variable in your data set.
colnames(iris)=str_replace(colnames(iris), pattern = ".", replacement = "_")
In short: rename()
and rename_with()
are for renaming variables, as their names imply. The str_
verbs from the stringr package are for editing string-based variabels in your data set. Either works though with a little ingenuity.
3.11 Re-order columns in a data set
Use relocate()
to change column positions. If you need to move multiple columns at once, this command uses the same syntax as select()
.
# notice the column order mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
%>% relocate(hp:wt, .after= am) %>% head() mtcars
## mpg cyl disp qsec vs am hp drat wt gear carb
## Mazda RX4 21.0 6 160 16.46 0 1 110 3.90 2.620 4 4
## Mazda RX4 Wag 21.0 6 160 17.02 0 1 110 3.90 2.875 4 4
## Datsun 710 22.8 4 108 18.61 1 1 93 3.85 2.320 4 1
## Hornet 4 Drive 21.4 6 258 19.44 1 0 110 3.08 3.215 3 1
## Hornet Sportabout 18.7 8 360 17.02 0 0 175 3.15 3.440 3 2
## Valiant 18.1 6 225 20.22 1 0 105 2.76 3.460 3 1
3.12 Date and time variables
Formatting a column of dates can be extremely helpful if you need to work with time data, but also an extreme pain in the ass if it’s not stored correctly. This tutorial will be divided into two parts to cover both scenarios that you could encounter. It requires things to be done in two stages, and very precisely.
3.12.1 Date-time objects
If you’re lucky enough to have a vector of date-times, like what Qualtrics gives you, this will be brainless. Just do the following:
=tibble::tribble(~datetime,
example_datetime_data"2010-08-03 00:50:50",
"2010-08-04 01:40:50",
"2010-08-07 21:50:50")
head(example_datetime_data) # stored as character string
## # A tibble: 3 x 1
## datetime
## <chr>
## 1 2010-08-03 00:50:50
## 2 2010-08-04 01:40:50
## 3 2010-08-07 21:50:50
# Tidyverse
::as_date(example_datetime_data$datetime) lubridate
## [1] "2010-08-03" "2010-08-04" "2010-08-07"
3.12.2 Date-only objects
If you’re unlucky enough to have only dates, and said dates are written in the traditional x/x/xxxx format, this will be an annoyance that has to be done in two stages.
First, assuming your data is already imported and is being stored as a vector of character strings, you have to tell R to adjust the formatting of dates. You cannot change it from a character-based object into a Date or DateTime one until it recognizes the correct formatting.
=tibble::tribble(~X1, ~X2,
example_date_data"8/4/2021", -49.87,
"8/4/2021", -13.85,
"8/3/2021", -7.45,
"8/3/2021", -172.71)
# Correct formatting
$X1=format(as.POSIXct(example_date_data$X1,format='%m/%d/%Y'),format='%Y-%m-%d')
example_date_data
head(as_tibble(example_date_data))
## # A tibble: 4 x 2
## X1 X2
## <chr> <dbl>
## 1 2021-08-04 -49.9
## 2 2021-08-04 -13.8
## 3 2021-08-03 -7.45
## 4 2021-08-03 -173.
In the code above, note that there are two format
commands: The first one tells R how the date data is currently being stored, while the second at the end tells it how you want it to be stored. In this case, we are changing it from the way we would usually hand write a date (e.g., 10/26/1993) to a format commonly recognized and used in Excel and stats software (1993-10-26). If your column also has times in it, you also need to include that too!
Second, you can now correct the object’s structure. You can do this with base R’s as.Date()
or tidyverse’s date()
verbs.
# Tidyverse
$X1= lubridate::date(example_date_data$X1)
example_date_data
# Base R version
$X1=as.Date(example_date_data$X1) example_date_data
Notice how the object is now stored as the correct type in the table above.
NOTE! This entire process has been included in the tidy_date()
command in my package, legaldmlab
.
3.13 Reverse-code a variable
To reverse-score a variable, you should use car::recode()
Can be done a few different ways, depending on how many variables you’re looking to recode:
# Recode just one variable
$column=recode(df$column,"1 = 7 ; 2 = 6 ; 3 = 5 ; 5 = 3 ; 6 = 2 ; 7 = 1")
df
# Recode a select bunch of variables
=df %>% mutate(across(c(family_close : family_feelings), recode, "1 = 7 ; 2 = 6 ; 3 = 5 ; 5 = 3 ; 6 = 2 ; 7 = 1")) df
3.14 Dummy coding (the very fast and easy way)
Use dplyr’s pivot_wider
in conjunction with mutate
to very quickly and automatically dummy code a column with any number of unique values. The middle part of the code below is what you need…just copy and paste it, and tweak the specifics
library(tidyverse)
|>
mtcars mutate(car=rownames(mtcars)) |>
::mutate(n=1) |>
dplyr::pivot_wider(names_from = cyl,
tidyrvalues_from = n,
names_prefix = "number_cyl",
values_fill = list(n=0)) |>
select(car, starts_with("number_")) |> head() #truncate output for easier reading
## # A tibble: 6 x 4
## car number_cyl6 number_cyl4 number_cyl8
## <chr> <dbl> <dbl> <dbl>
## 1 Mazda RX4 1 0 0
## 2 Mazda RX4 Wag 1 0 0
## 3 Datsun 710 0 1 0
## 4 Hornet 4 Drive 1 0 0
## 5 Hornet Sportabout 0 0 1
## 6 Valiant 1 0 0
3.15 Create a relative ranking among several variables
If you want to create a variable that is an ordinal ranking of other variables, first you need to make sure your data is long-wise. Then, depending on the type of ranking system you want, you’ll might need a different ranking command….
The
min_rank
command from dplyr works in a manner similar to base R’srank
command. It ranks things like you see in sporting events. For example, if there is a clear winner in a game but 3 people tie for second place, the ranks would look like this: 1,2,2,2,4,5. Notice that the positions are independent from the counts.Using the same example from above, if you want the ranks to have no gaps (i.e. 1,2,2,2,3,4), you need to use dplyr’s
dense_rank
command.In either case, the ranks are generated from lowest to highest, so if you want to flip them around you’ll need to include
desc()
in the command.
=tibble::tribble(~name, ~score,
dat"bob", 0,
"bob", 5,
"bob", 50,
"bob", 50,
"bob", 50,
"bob", NA,
"alice", 70,
"alice", 80,
"alice", 90,
"alice", 20,
"alice", 20,
"alice", 1)
%>% mutate(ranked = dense_rank(desc(score))) dat
## # A tibble: 12 x 3
## name score ranked
## <chr> <dbl> <int>
## 1 bob 0 8
## 2 bob 5 6
## 3 bob 50 4
## 4 bob 50 4
## 5 bob 50 4
## 6 bob NA NA
## 7 alice 70 3
## 8 alice 80 2
## 9 alice 90 1
## 10 alice 20 5
## 11 alice 20 5
## 12 alice 1 7
3.17 Wrangling Lists
3.17.1 Nesting
Imagine the concept of Russian Dolls, applied to data sets. You can manage data sets more effectively my collapsing them into a single tiny, mini data frame, and stuffing that inside of another one. This is done via “nesting”…
Effectively, you smush/collapse everything down so it fits inside one column. You can unnest to expand this data back out later when you need it, and keep it collapsed when you don’t. This works because a vector/column in a data frame is a list of a defined length; and a data frame is thus simply a collection of lists that are all the same length.
You can store anything in a data frame. You can keep the df connected to the model, which makes it very easy to manage a whole slew of related models
You can use functional programming (i.e., iterative functions) to map functions or combinations of functions in new ways. Moreover and more importantly, when you use
purrr
to map functions onto multiple models or objects simultaneously, you’re doing it to all of them at once with a single command, and the objects are kept together while you do it. This limits the mistakes you can make (e.g., copying and pasting code and forgetting to tweak something important; applying a function to the wrong object or set of objects by accident), and also reduces unnecessary code in your script.Converting data into tidy data sets gives you a whole new way (and easier way) to manage lots of information
head(mtcars |> nest(crap=vs:carb))
## # A tibble: 6 x 8
## mpg cyl disp hp drat wt qsec crap
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
## 1 21 6 160 110 3.9 2.62 16.5 <tibble [1 x 4]>
## 2 21 6 160 110 3.9 2.88 17.0 <tibble [1 x 4]>
## 3 22.8 4 108 93 3.85 2.32 18.6 <tibble [1 x 4]>
## 4 21.4 6 258 110 3.08 3.22 19.4 <tibble [1 x 4]>
## 5 18.7 8 360 175 3.15 3.44 17.0 <tibble [1 x 4]>
## 6 18.1 6 225 105 2.76 3.46 20.2 <tibble [1 x 4]>