Chapter 3 Wrangle Data

This chapter contains useful tips on wrangling (i.e., manipulating) data. If you need to know to do to things like create new variables, split one variable into multiple variables, pivot a data set from wide to long, etc., look no further.

If you want a pretty good intro tutorial to the dplyr package, click here

3.1 Joining or Splitting

Joining and splitting data is pretty straightforward….

3.1.1 Whole Data Sets

The code below is from this excellent tutorial

set.seed(2018)

df1=data.frame(customer_id=c(1:10),
               product=sample(c('toaster','TV','Dishwasher'),10,replace = TRUE))


df2=data.frame(customer_id=c(sample(df1$customer_id, 5)),state=sample(c('New York','California'),5,replace = TRUE))

df1=tibble::as_tibble(df1)
df2=tibble::as_tibble(df2)

# df1 =left table
# df2= right table

Inner join - retains only rows with values that appear in both tables, and matches by keys.

If you’re joining two Qualtrics surveys together, this is most likely the one you want to use (e.g. matching by participant name, and only keeping rows in the joined data set for participants that have responses logged in both survey 1 and survey 2

df1 %>% inner_join(df2,by='customer_id')
## # A tibble: 5 x 3
##   customer_id product    state   
##         <int> <chr>      <chr>   
## 1           1 Dishwasher New York
## 2           3 Dishwasher New York
## 3           6 toaster    New York
## 4           8 Dishwasher New York
## 5           9 Dishwasher New York

Left join - returns everything in the left, and rows with matching keys in the right

df1 %>% left_join(df2,by='customer_id')
## # A tibble: 10 x 3
##    customer_id product    state   
##          <int> <chr>      <chr>   
##  1           1 Dishwasher New York
##  2           2 Dishwasher <NA>    
##  3           3 Dishwasher New York
##  4           4 toaster    <NA>    
##  5           5 TV         <NA>    
##  6           6 toaster    New York
##  7           7 toaster    <NA>    
##  8           8 Dishwasher New York
##  9           9 Dishwasher New York
## 10          10 TV         <NA>

Right join - returns everything in the right, and rows with matching keys in the left

df1 %>% right_join(df2,by='customer_id')
## # A tibble: 5 x 3
##   customer_id product    state   
##         <int> <chr>      <chr>   
## 1           1 Dishwasher New York
## 2           3 Dishwasher New York
## 3           6 toaster    New York
## 4           8 Dishwasher New York
## 5           9 Dishwasher New York
# note: example if the customer id column was named something different in the second df
    #df1 %>% left_join(df2,by=c('customer_id'='name2'))

Full join - retain all rows from both tables, and join matching keys in both right and left

df1 %>% full_join(df2,by='customer_id')
## # A tibble: 10 x 3
##    customer_id product    state   
##          <int> <chr>      <chr>   
##  1           1 Dishwasher New York
##  2           2 Dishwasher <NA>    
##  3           3 Dishwasher New York
##  4           4 toaster    <NA>    
##  5           5 TV         <NA>    
##  6           6 toaster    New York
##  7           7 toaster    <NA>    
##  8           8 Dishwasher New York
##  9           9 Dishwasher New York
## 10          10 TV         <NA>

Anti join - returns all rows in the left that do not have matching keys in the right

df1 %>% anti_join(df2,by='customer_id')
## # A tibble: 5 x 2
##   customer_id product   
##         <int> <chr>     
## 1           2 Dishwasher
## 2           4 toaster   
## 3           5 TV        
## 4           7 toaster   
## 5          10 TV

3.1.2 Individual Columns/Variables

Splitting or joining columns is much easier than doing it to whole data sets. You can use dplyr::separate() to accomplish the former, and dplyr::unite() for the latter.

print("hello")
## [1] "hello"

3.2 Selecting/extracting specific variables with select()

Sometimes when working with a data set, you want to work with a few specific variables. For instance, maybe you want to view a graph of only reverse-coded variables (which start with the prefix “r”); or maybe you want to create a subset of your data that has a few specific variables removed. For this you can use dplyr::select() and its associated helper commands

select() can be thought of as “extract”; it tells R to identify and “extract” a specific variable (or variables)

cars=mtcars

# select one column
cars %>% select(mpg)

# select multiple columns, if they are all next to one another
cars %>% select(mpg:hp)

# select multiple columns by name (when not next to one another) by defining them in a vector
cars %>% select(c(mpg, hp, wt))

# select only variables that start with a certain prefix/character/pattern/etc.
cars %>% select(starts_with("d"))

# ...or columns that end with a certain prefix/etc.
cars %>% select(ends_with("t"))

# ...or contains a certain pattern or string
cars %>% select(contains("se"))

# select ALL OF the variables in a data set that match those of a pre-defined vector
  
  # first define the names in a vector
  vars=c("hp", "drat", "gear", "carb")
  
  #now use helper
  cars %>% select(all_of(vars))
  
# select ANY OF the variables in a pre-defined vector
  
  vars_2=c("hp", "drat", "watermelon", "grilled_cheese") # only the first two will be in the data
  
  cars %>% select(any_of(vars_2)) # only (and all of) the variables actually PRESENT in the data are pulled
  
# select only variables of a certain class or type
  cars %>% select(where(is.numeric))
  cars %>% select(where(is.character))

Other examples can be seen on THIS LINK for a simple but detailed guide.

3.3 Advanced Filtering techniques

All info here and the code taken from this link; credit goes to Suzan Baert.

3.3.1 Filter based on partial match

Combine filter with str_detect to search for pattern matches in a column

3.3.2 Filtering based on multiple conditions

You can filter for more than one value at once, or combine operators to do searches with conditions….

Goal Code
Return only rows where both conditions are met filter(condition1, condition2)
Return all rows where condition 1 is true but condition 2 is not filter(condition1, !condition2)
Return rows where condition 1 and/or condition 2 is met filter(condition1 | condition2)
Return all rows where only one of the conditions is met, and not when both are met filter(xor(condition1, condition2)

The sample code will return all rows with a bodywt above 100 and either have a sleep_total above 15 or are not part of the Carnivora order.

msleep %>% 
  select(name, order, sleep_total:bodywt) %>% 
  filter(bodywt > 100, (sleep_total > 15 | order != "Carnivora"))

3.3.3 Filtering across multiple columns

From Susan Baert’s blog page:

You have three options for how to do this:

  • filter_all(), which filters columns based on your further instructions

  • filter_if(), which requires a boolean to indicate which columns to filter on. If that is true, the filter instructions will be followed.

  • filter_at(), which requires you to specify columns inside a vars argument for which the filtering will be done.

Retain all rows with the pattern match “Ca” inside. Useful for when you want to search for a key pattern (or number) across multiple columns.

msleep %>% 
  select(name:order, sleep_total, -vore) %>% 
  filter_all(any_vars(str_detect(., pattern = "Ca")))

msleep %>%  
  select(name, sleep_total:bodywt) %>% 
  filter_all(any_vars(. < 0.1))

# You can also switch any_vars to all_vars to filter across the whole data frame
msleep %>%  
  select(name, sleep_total:bodywt, -awake) %>% 
  filter_all(all_vars(. > 1))

What if you want to filter specific column types in your data frame, like just date columns, to find a specific date across multiple date columns? Or filter across multiple number columns to find every instance of a specific number? filter_if is better than filter_all here because the latter would return a filter search across the whole data frame. By using filter_if, we can get more specific about which columns to search through.

msleep %>% 
  select(name:order, sleep_total:sleep_rem) %>% 
  filter_if(is.character, any_vars(is.na(.)))

The last command, filter_at, does things differently. It doesn’t filter all columns at once; and you do not have to specify the column type or which columns. This one allows you to indicate which columns to search through like you would within any select statement.

Example: Search through the columns that start with “sleep”, for values in all of those columns above .5.

msleep %>% 
  select(name, sleep_total:sleep_rem, brainwt:bodywt) %>% 
  filter_at(vars(contains("sleep")), all_vars(.>5))

3.4 If-then and Case-when

3.4.1 If-then

The premise of an if/then or if/else statement is simple: “If condition 1 is satisfied, perform x operation; if not, then do y”

mtcars %>% mutate(power_level=ifelse(mtcars$hp<350, "Low", "High")) %>% head()
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb power_level
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Low
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         Low
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         Low
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1         Low
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2         Low
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1         Low

This line of code effectively says: if the length in Sepal.Length is >5, set new variable = to “short”; else, set it to “long”

3.4.2 Case-when

When you have 3+ conditions, it’s easier to use case-when. This is a more simple and straightforward approach than nesting multiple if-else commands

My_vector= case_when(
    Condition1 ~ value1,
    Condition2 ~ value2,
    Condition3 ~ value3
    TRUE ~ valueForEverythingElse #catch all for things that don't meet the above conditions
    )

Example:

mtcars %>% mutate(size= case_when(cyl==4 ~ "small",
                                  cyl==6 ~ "medium",
                                  cyl==8 ~ "large")) %>% 
  select(c(cyl,size)) %>% head()
##                   cyl   size
## Mazda RX4           6 medium
## Mazda RX4 Wag       6 medium
## Datsun 710          4  small
## Hornet 4 Drive      6 medium
## Hornet Sportabout   8  large
## Valiant             6 medium

3.5 Conditional replacement of values

The following code is useful if you want to replace a value in one column, and the replacement is conditional upon the value in another column.

mpg %>% 
  mutate(across(.cols = c(displ, cty, hwy),
                .fns = ~case_when(cyl == 4L ~ as.numeric(NA),
                                  TRUE ~ as.numeric(.x))))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans    drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr>    <chr> <dbl> <dbl> <chr> <chr>
##  1 audi         a4          NA    1999     4 auto(l5) f        NA    NA p     comp~
##  2 audi         a4          NA    1999     4 manual(~ f        NA    NA p     comp~
##  3 audi         a4          NA    2008     4 manual(~ f        NA    NA p     comp~
##  4 audi         a4          NA    2008     4 auto(av) f        NA    NA p     comp~
##  5 audi         a4           2.8  1999     6 auto(l5) f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manual(~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto(av) f        18    27 p     comp~
##  8 audi         a4 quattro  NA    1999     4 manual(~ 4        NA    NA p     comp~
##  9 audi         a4 quattro  NA    1999     4 auto(l5) 4        NA    NA p     comp~
## 10 audi         a4 quattro  NA    2008     4 manual(~ 4        NA    NA p     comp~
## # ... with 224 more rows
test %>% 
  mutate(across(.cols = c(rank),
                .fns = ~case_when(is.na(participant_score) ~ as.numeric(NA),
                                  TRUE ~ as.numeric(.x))))

3.6 Merging variables

Sometimes you’ll have multiple variables and you want to collapse them into a single variable. The pmin() command is useful for this.

example_data=tribble(~A,~B,~C,
                     1,NA,NA,
                     2,NA,NA,
                     3,NA,NA,
                     NA,4,NA,
                     NA,5,NA,
                     NA,6,NA,
                     NA,NA,7,
                     NA,NA,8,
                     NA,NA,9)

example_data %>%
  mutate(accept_reject = 
           pmin(A,B,C,na.rm = TRUE))

3.7 Apply a function to multiple variables at once

You can either specify each column individually, like above, or tell R to identify columns for you based on their type or their name. This requires adding in one additional verb–either contains() or where() depending on what you want to do.

Two simple examples:

# turn multiple variables into factors
ex_data=dplyr::tribble(~color, ~car,
                       "red", "corvette",
                       "blue", "chevelle",
                       "green", "camaro",
                       "red", "corvette",
                       "green", "chevelle",
                       "yellow", "gto")

dplyr::glimpse(ex_data)
## Rows: 6
## Columns: 2
## $ color <chr> "red", "blue", "green", "red", "green", "yellow"
## $ car   <chr> "corvette", "chevelle", "camaro", "corvette", "chevelle", "gto"
ex_data %>% mutate(across(c(color, car),factor))
## # A tibble: 6 x 2
##   color  car     
##   <fct>  <fct>   
## 1 red    corvette
## 2 blue   chevelle
## 3 green  camaro  
## 4 red    corvette
## 5 green  chevelle
## 6 yellow gto
# round multiple columns to 1 decimal place
mtcars %>% mutate(across(c(disp:qsec),round,1)) %>% head()
##                    mpg cyl disp  hp drat  wt qsec vs am gear carb
## Mazda RX4         21.0   6  160 110  3.9 2.6 16.5  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110  3.9 2.9 17.0  0  1    4    4
## Datsun 710        22.8   4  108  93  3.9 2.3 18.6  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110  3.1 3.2 19.4  1  0    3    1
## Hornet Sportabout 18.7   8  360 175  3.1 3.4 17.0  0  0    3    2
## Valiant           18.1   6  225 105  2.8 3.5 20.2  1  0    3    1

3.8 Pivoting (i.e., transposing) data

3.8.1 Condense multiple rows into a single column (pivot wide to long)

Rearranging data like this can make it easier to work with and analyze. Example below from my gradebook for stats (exported from Canvas), with fake names.

The command structure is as follows:

 pivot_longer( # Transpose LENGTHWISE by....
    cols = everything(), # Taking ALL variable names...
    names_to="variable", # ...and dumping them into this new variable/column
    values_to="missing_count") #...and placing their values in this other new column

NOTE!!! Pivoting data from wide to long like this expands the number of rows to make a matrix so that (for example, each student now has as a row for each assignment). Therefore, you can only pivot longways (or wide) ONCE, otherwise you will make duplicates.

If you need to pivot multiple columns, just include all of the columns in one single pivot; do not use two separate, back to back pivot commands.

Example:

gradebook=tibble::tribble(
  ~Student, ~Homework.1, ~Homework.2, ~Homework.3, ~Homework.4, ~Homework.5, ~Quiz.1, ~Quiz.2, ~Quiz.3, ~Quiz.4, ~Final,
     "Bob",         19L,          0L,          13,          16,          0L,      21,      7L,      15,    17.5,     33,
    "Jane",         17L,         19L,          16,        16.5,         25L,    21.5,     19L,   14.75,     9.5,   39.5,
    "John",         19L,         19L,        14.5,        19.5,         25L,      21,     21L,    18.5,      17,   46.5
  )

head(gradebook)
## # A tibble: 3 x 11
##   Student Homework.1 Homework.2 Homework.3 Homework.4 Homework.5 Quiz.1 Quiz.2
##   <chr>        <int>      <int>      <dbl>      <dbl>      <int>  <dbl>  <int>
## 1 Bob             19          0       13         16            0   21        7
## 2 Jane            17         19       16         16.5         25   21.5     19
## 3 John            19         19       14.5       19.5         25   21       21
## # ... with 3 more variables: Quiz.3 <dbl>, Quiz.4 <dbl>, Final <dbl>
gradebook=gradebook %>% 
   pivot_longer( # Transpose lengthwise by:
    cols = Homework.1:Final, # Taking these variables
    names_to="Assignment", # ...and dumping them into this new variable, storing them lengthwise
    values_to="Points") #...then place their values in this new column

gradebook %>% head()
## # A tibble: 6 x 3
##   Student Assignment Points
##   <chr>   <chr>       <dbl>
## 1 Bob     Homework.1     19
## 2 Bob     Homework.2      0
## 3 Bob     Homework.3     13
## 4 Bob     Homework.4     16
## 5 Bob     Homework.5      0
## 6 Bob     Quiz.1         21

3.9 Turn row names into a column/variable

Use the rownames() command to turn row names into a variable

cars=rownames_to_column(mtcars, var = "car")

as_tibble(cars) %>% slice(1:6)
## # A tibble: 6 x 12
##   car               mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>           <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4        21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda RX4 Wag    21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsun 710       22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4 Hornet 4 Drive   21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5 Hornet Sportab~  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6 Valiant          18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

3.10 How to edit/change column names

TWO WAYS TO DO THIS: Use colnames() (for base R) or rename() (for tidyverse)

colnames() pulls up all the column/variable names as a vector. If you want to actually change them, you’ll need to combine this command with something like the sub() or gsub() commands (for base R). I’m going to skip this because…its base R.

To access and change the names faster via tidyverse, run use rename()

rm(list=ls()) # clear R's memory

iris %>% rename("hurr"="Sepal.Length",
                "durr"="Sepal.Width",
                "abcdefgh"="Species") %>% 
  head()
##   hurr durr Petal.Length Petal.Width abcdefgh
## 1  5.1  3.5          1.4         0.2   setosa
## 2  4.9  3.0          1.4         0.2   setosa
## 3  4.7  3.2          1.3         0.2   setosa
## 4  4.6  3.1          1.5         0.2   setosa
## 5  5.0  3.6          1.4         0.2   setosa
## 6  5.4  3.9          1.7         0.4   setosa

If you need to do some really fancy conditional renaming (e.g., changing all variables that start with “r” to start with “rf” instead, to make it more clear that the prefix actually stands for “risk factor” rather than “reverse coded”), you’ll need to use rename_with().

This command has two parts to it: the data set, and the function you wish to apply to it (which you put after the ~)

rename_with(iris, ~ gsub(pattern = ".", replacement = "_", .x, fixed = TRUE)) %>% 
  head()
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The gsub() function from Base R identifies matching patterns in the data and substitutes them with what you want instead. Think of it like R’s version of Find/Replace from Microsoft Word.

The above line of code thus does the following: 1. First, it checks the column names of the supplied data set (iris) for a specific pattern (specified in pattern= ) 2. Then it replaces that pattern with your input in replacement=

The great thing about rename_with() is that the .fn (or ~ for short) can take ANY function as input. For example, if you want to add an element to the column names rather than replace something, (e.g., a prefix or suffix), you can change the function to:

rename_with( iris, ~ paste0(.x,  "_text")) %>% 
  head()
##   Sepal.Length_text Sepal.Width_text Petal.Length_text Petal.Width_text
## 1               5.1              3.5               1.4              0.2
## 2               4.9              3.0               1.4              0.2
## 3               4.7              3.2               1.3              0.2
## 4               4.6              3.1               1.5              0.2
## 5               5.0              3.6               1.4              0.2
## 6               5.4              3.9               1.7              0.4
##   Species_text
## 1       setosa
## 2       setosa
## 3       setosa
## 4       setosa
## 5       setosa
## 6       setosa

The above line adds a suffix. You can also add a prefix in the exact same way, just by switching the order of the string and the pattern in the paste0 command.

Alternative method to the above This is a second way to do the above. It may appear more simple, but it’s also probably not as theoretically consistent with how the packages were made…..it uses the stringr package to rename the column names, and stringr is typically used for editing vectors of strings in a data set.

…so it works, but it’s a little unconventional because you call and edit the column names like you would a variable in your data set.

colnames(iris)=str_replace(colnames(iris), pattern = ".", replacement = "_")

In short: rename() and rename_with() are for renaming variables, as their names imply. The str_ verbs from the stringr package are for editing string-based variabels in your data set. Either works though with a little ingenuity.

3.11 Re-order columns in a data set

Use relocate() to change column positions. If you need to move multiple columns at once, this command uses the same syntax as select().

mtcars # notice the column order
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
mtcars %>% relocate(hp:wt, .after= am) %>% head()
##                    mpg cyl disp  qsec vs am  hp drat    wt gear carb
## Mazda RX4         21.0   6  160 16.46  0  1 110 3.90 2.620    4    4
## Mazda RX4 Wag     21.0   6  160 17.02  0  1 110 3.90 2.875    4    4
## Datsun 710        22.8   4  108 18.61  1  1  93 3.85 2.320    4    1
## Hornet 4 Drive    21.4   6  258 19.44  1  0 110 3.08 3.215    3    1
## Hornet Sportabout 18.7   8  360 17.02  0  0 175 3.15 3.440    3    2
## Valiant           18.1   6  225 20.22  1  0 105 2.76 3.460    3    1

3.12 Date and time variables

Formatting a column of dates can be extremely helpful if you need to work with time data, but also an extreme pain in the ass if it’s not stored correctly. This tutorial will be divided into two parts to cover both scenarios that you could encounter. It requires things to be done in two stages, and very precisely.

3.12.1 Date-time objects

If you’re lucky enough to have a vector of date-times, like what Qualtrics gives you, this will be brainless. Just do the following:

example_datetime_data=tibble::tribble(~datetime,
                                      "2010-08-03 00:50:50",
                                      "2010-08-04 01:40:50",
                                      "2010-08-07 21:50:50")

head(example_datetime_data) # stored as character string
## # A tibble: 3 x 1
##   datetime           
##   <chr>              
## 1 2010-08-03 00:50:50
## 2 2010-08-04 01:40:50
## 3 2010-08-07 21:50:50
# Tidyverse
lubridate::as_date(example_datetime_data$datetime)
## [1] "2010-08-03" "2010-08-04" "2010-08-07"

3.12.2 Date-only objects

If you’re unlucky enough to have only dates, and said dates are written in the traditional x/x/xxxx format, this will be an annoyance that has to be done in two stages.

First, assuming your data is already imported and is being stored as a vector of character strings, you have to tell R to adjust the formatting of dates. You cannot change it from a character-based object into a Date or DateTime one until it recognizes the correct formatting.

example_date_data=tibble::tribble(~X1,     ~X2,
                                  "8/4/2021",  -49.87,
                                  "8/4/2021",  -13.85,
                                  "8/3/2021",   -7.45,
                                  "8/3/2021", -172.71)

# Correct formatting
example_date_data$X1=format(as.POSIXct(example_date_data$X1,format='%m/%d/%Y'),format='%Y-%m-%d')

head(as_tibble(example_date_data))
## # A tibble: 4 x 2
##   X1              X2
##   <chr>        <dbl>
## 1 2021-08-04  -49.9 
## 2 2021-08-04  -13.8 
## 3 2021-08-03   -7.45
## 4 2021-08-03 -173.

In the code above, note that there are two format commands: The first one tells R how the date data is currently being stored, while the second at the end tells it how you want it to be stored. In this case, we are changing it from the way we would usually hand write a date (e.g., 10/26/1993) to a format commonly recognized and used in Excel and stats software (1993-10-26). If your column also has times in it, you also need to include that too!

Second, you can now correct the object’s structure. You can do this with base R’s as.Date() or tidyverse’s date() verbs.

# Tidyverse
example_date_data$X1= lubridate::date(example_date_data$X1) 

# Base R version
example_date_data$X1=as.Date(example_date_data$X1)

Notice how the object is now stored as the correct type in the table above.

NOTE! This entire process has been included in the tidy_date() command in my package, legaldmlab.

3.12.3 Find the difference between two dates/times

difftime(part_1$end_date[1], part_2$end_date[1], units="days")

3.13 Reverse-code a variable

To reverse-score a variable, you should use car::recode() Can be done a few different ways, depending on how many variables you’re looking to recode:

# Recode just one variable
df$column=recode(df$column,"1 = 7 ; 2 = 6 ; 3 = 5 ; 5 = 3 ; 6 = 2 ; 7 = 1")

# Recode a select bunch of variables
df=df %>% mutate(across(c(family_close : family_feelings), recode, "1 = 7 ; 2 = 6 ; 3 = 5 ; 5 = 3 ; 6 = 2 ; 7 = 1"))

3.14 Dummy coding (the very fast and easy way)

Use dplyr’s pivot_wider in conjunction with mutate to very quickly and automatically dummy code a column with any number of unique values. The middle part of the code below is what you need…just copy and paste it, and tweak the specifics

library(tidyverse)

mtcars |> 
  mutate(car=rownames(mtcars)) |> 
  
  dplyr::mutate(n=1) |> 
  tidyr::pivot_wider(names_from = cyl,
              values_from = n,
              names_prefix = "number_cyl",
              values_fill = list(n=0)) |> 
  
  select(car, starts_with("number_")) |> head() #truncate output for easier reading
## # A tibble: 6 x 4
##   car               number_cyl6 number_cyl4 number_cyl8
##   <chr>                   <dbl>       <dbl>       <dbl>
## 1 Mazda RX4                   1           0           0
## 2 Mazda RX4 Wag               1           0           0
## 3 Datsun 710                  0           1           0
## 4 Hornet 4 Drive              1           0           0
## 5 Hornet Sportabout           0           0           1
## 6 Valiant                     1           0           0

3.15 Create a relative ranking among several variables

If you want to create a variable that is an ordinal ranking of other variables, first you need to make sure your data is long-wise. Then, depending on the type of ranking system you want, you’ll might need a different ranking command….

  • The min_rank command from dplyr works in a manner similar to base R’s rank command. It ranks things like you see in sporting events. For example, if there is a clear winner in a game but 3 people tie for second place, the ranks would look like this: 1,2,2,2,4,5. Notice that the positions are independent from the counts.

  • Using the same example from above, if you want the ranks to have no gaps (i.e. 1,2,2,2,3,4), you need to use dplyr’s dense_rank command.

  • In either case, the ranks are generated from lowest to highest, so if you want to flip them around you’ll need to include desc() in the command.

dat=tibble::tribble(~name, ~score,
                    "bob", 0,
                    "bob", 5,
                    "bob", 50,
                    "bob", 50,
                    "bob", 50,
                    "bob", NA,
                    "alice", 70,
                    "alice", 80,
                    "alice", 90,
                    "alice", 20,
                    "alice", 20,
                    "alice", 1)

dat %>% mutate(ranked = dense_rank(desc(score)))
## # A tibble: 12 x 3
##    name  score ranked
##    <chr> <dbl>  <int>
##  1 bob       0      8
##  2 bob       5      6
##  3 bob      50      4
##  4 bob      50      4
##  5 bob      50      4
##  6 bob      NA     NA
##  7 alice    70      3
##  8 alice    80      2
##  9 alice    90      1
## 10 alice    20      5
## 11 alice    20      5
## 12 alice     1      7

3.16 Manipulating the working environment and many things at once

3.16.1 Stuff the WHOLE working environment into a list

files=mget(ls())

3.16.2 Extract everything from a list into the environment

list2env(cog_data, globalenv())

3.16.3 Delete everything in the entire environment, except for one item

rm(list=setdiff(ls(), "cog_data")) # delete everything in the local environment except the final data set

3.17 Wrangling Lists

3.17.1 Nesting

Imagine the concept of Russian Dolls, applied to data sets. You can manage data sets more effectively my collapsing them into a single tiny, mini data frame, and stuffing that inside of another one. This is done via “nesting”…

Effectively, you smush/collapse everything down so it fits inside one column. You can unnest to expand this data back out later when you need it, and keep it collapsed when you don’t. This works because a vector/column in a data frame is a list of a defined length; and a data frame is thus simply a collection of lists that are all the same length.

  • You can store anything in a data frame. You can keep the df connected to the model, which makes it very easy to manage a whole slew of related models

  • You can use functional programming (i.e., iterative functions) to map functions or combinations of functions in new ways. Moreover and more importantly, when you use purrr to map functions onto multiple models or objects simultaneously, you’re doing it to all of them at once with a single command, and the objects are kept together while you do it. This limits the mistakes you can make (e.g., copying and pasting code and forgetting to tweak something important; applying a function to the wrong object or set of objects by accident), and also reduces unnecessary code in your script.

  • Converting data into tidy data sets gives you a whole new way (and easier way) to manage lots of information

head(mtcars |> nest(crap=vs:carb))
## # A tibble: 6 x 8
##     mpg   cyl  disp    hp  drat    wt  qsec crap            
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>          
## 1  21       6   160   110  3.9   2.62  16.5 <tibble [1 x 4]>
## 2  21       6   160   110  3.9   2.88  17.0 <tibble [1 x 4]>
## 3  22.8     4   108    93  3.85  2.32  18.6 <tibble [1 x 4]>
## 4  21.4     6   258   110  3.08  3.22  19.4 <tibble [1 x 4]>
## 5  18.7     8   360   175  3.15  3.44  17.0 <tibble [1 x 4]>
## 6  18.1     6   225   105  2.76  3.46  20.2 <tibble [1 x 4]>

3.17.2 Combining/collapsing list levels (Reducing)

demo_vars=files |> 
  map(import_spss) |> 
  reduce(left_join, by="catieid")