Chapter 9 Misc. Stuff

9.1 Scrape web pages for data tables

Note. See Chapter 10’s example purrr walk through for a guide on how to scrape multiple web tables simultaneously

Simple example.

library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(tidyverse)

html=read_html('https://shop.tcgplayer.com/price-guide/pokemon/base-set') %>% 
  html_table(fill = TRUE)

html

## [[1]]
## # A tibble: 101 x 6
##    PRODUCT   Rarity    Number  `Market Price` `Listed Median` ``   
##    <chr>     <chr>     <chr>   <chr>          <chr>           <chr>
##  1 Abra      Common    043/102 $0.39          $0.46           View 
##  2 Alakazam  Holo Rare 001/102 $40.64         —               View 
##  3 Arcanine  Uncommon  023/102 $2.94          $2.41           View 
##  4 Beedrill  Rare      017/102 $3.12          $3.20           View 
##  5 Bill      Common    091/102 $0.25          $0.30           View 
##  6 Blastoise Holo Rare 002/102 $119.59        —               View 
##  7 Bulbasaur Common    044/102 $1.39          $2.21           View 
##  8 Caterpie  Common    045/102 $0.70          $0.75           View 
##  9 Chansey   Holo Rare 003/102 $25.35         —               View 
## 10 Charizard Holo Rare 004/102 $348.99        —               View 
## # ... with 91 more rows

# Saved as a list by default. Now extract your table from said list
html=as_tibble(html[[1]] %>% # find out which number it is in the list
                 select('PRODUCT','Rarity','Number','Market Price')) # if needed, specify which columns you want too

html

## # A tibble: 101 x 4
##    PRODUCT   Rarity    Number  `Market Price`
##    <chr>     <chr>     <chr>   <chr>         
##  1 Abra      Common    043/102 $0.39         
##  2 Alakazam  Holo Rare 001/102 $40.64        
##  3 Arcanine  Uncommon  023/102 $2.94         
##  4 Beedrill  Rare      017/102 $3.12         
##  5 Bill      Common    091/102 $0.25         
##  6 Blastoise Holo Rare 002/102 $119.59       
##  7 Bulbasaur Common    044/102 $1.39         
##  8 Caterpie  Common    045/102 $0.70         
##  9 Chansey   Holo Rare 003/102 $25.35        
## 10 Charizard Holo Rare 004/102 $348.99       
## # ... with 91 more rows

# remove $ symbol in Price column to make it easier to work with
html$`Market Price`=str_remove(html$`Market Price`, pattern = "\\$")
  
html=html %>%  mutate(`Market Price`=as.numeric(`Market Price`)) # convert from string to numeric

# view finished table
head(html)

## # A tibble: 6 x 4
##   PRODUCT   Rarity    Number  `Market Price`
##   <chr>     <chr>     <chr>            <dbl>
## 1 Abra      Common    043/102           0.39
## 2 Alakazam  Holo Rare 001/102          40.6 
## 3 Arcanine  Uncommon  023/102           2.94
## 4 Beedrill  Rare      017/102           3.12
## 5 Bill      Common    091/102           0.25
## 6 Blastoise Holo Rare 002/102         120.

Slightly more complicated example

Reading a table into R takes a few steps.

Step 1 is to copy and paste the URL into the read_html() verb like below:

pacman::p_load(rvest, tidyverse)

exonerations_table=read_html("https://www.law.umich.edu/special/exoneration/Pages/detaillist.aspx") %>% 
  html_nodes("table.ms-listviewtable") %>% 
  html_table(fill=TRUE, header = TRUE)

Sometimes if the web page is extremely basic and pretty much the only thing on it is a table, you can stop there. Most of the time though, there will be tons of other stuff on the website and you need to get more specific so R can find the table. This is the html_nodes() part of the above command; in there you specify the exact part of the web page where the table is located/what object file it is.

To find this you will need to use the Developer mode in your browser. See this screenshot for an example…

knitr::include_graphics(here::here("pics", "scrape.png"))

In Firefox you open this by going to Settings > More Tools > Web Developer Tools (or CNTRL + Shift + I).

Begin by looking through the console in the center bottom for names that look like they would be related to your table. A good place to start might be “

“, which contains the main body of the web page. Click on a name to expand it and see all the elements on the page contained there.

Ultimately what you’re looking for is what you see above: an element that, when selected, highlights ONLY the area of the web page you’re looking for. To get at this you will need to keep expanding, highlighting, and clicking repeatedly….it can take some digging.

Keep drilling down through page elements until you find the one that highlights the table and just the table. When you find this, look for the .ms file in that name; you should also see this in the smaller console box on the right. That is the file you’ll need. Write that name in the html_node command and read it into R.

That’s stage 1. From here you now need to clean up the table.

exonerations_table=as.data.frame(exonerations_table) # convert into a df

Your table might be different, but this one’s names were messed up when read in, so lets fix those first and then fix the rows and columns.

# save the names to a vector
table_names=exonerations_table$Last.Name[1:20]

# Trim out the garbage rows and columns
exonerations_table=exonerations_table %>% 
  select(Last.Name:Tags.1) %>% 
  slice(22:n())

# over-write incorrect col names with the vector of correct ones we saved above
colnames(exonerations_table)=table_names

# clean up names
exonerations_table=exonerations_table %>% janitor::clean_names()

# verify structure of columns is correct
# glimpse(exonerations_table)

Yikes, a lot of stuff is stored incorrectly, and as a result there’s some missing values that need to be addressed and other data that needs to be corrected.

exonerations_table=as_tibble(exonerations_table) %>% # convert to tibble
  mutate(across(c(dna,mwid:ild), na_if,"")) %>% # turn missing values into NA's
  mutate(across(c(dna,mwid:ild), replace_na, "derp")) %>% # replace NA's with a string (required for the next lines to work)
  mutate(dna=ifelse(dna=="DNA",1,0), # change these variables from text to numeric to better facilitate analysis
         mwid=ifelse(mwid=="MWID",1,0),
         fc=ifelse(fc=="FC",1,0),
         p_fa=ifelse(p_fa=="P/FA",1,0),
         f_mfe=ifelse(f_mfe=="F/MFE",1,0)) %>% 
  mutate(across(c(st, crime, dna:f_mfe),factor)) # correct form by converting to factors

And that’s it! Check out final result!

head(exonerations_table)

## # A tibble: 6 x 20
##   last_name first_name     age race     st    county_of_crime tags   om_tags  crime sentence convicted exonerated dna   mwid  fc    p_fa  f_mfe om    ild  
##   <chr>     <chr>        <int> <chr>    <fct> <chr>           <chr>  <chr>    <fct> <chr>        <int>      <int> <fct> <fct> <fct> <fct> <fct> <chr> <chr>
## 1 Abbott    Cinque          19 Black    IL    Cook            CIU, ~ "OF, WH~ Drug~ Probati~      2008       2022 0     0     0     1     0     OM    derp 
## 2 Abdal     Warith Habib    43 Black    NY    Erie            IO, SA "OF, WH~ Sexu~ 20 to L~      1983       1999 1     1     0     0     1     OM    derp 
## 3 Abernathy Christopher     17 White    IL    Cook            CIU, ~ "OF, WH~ Murd~ Life wi~      1987       2015 1     0     1     1     0     OM    derp 
## 4 Abney     Quentin         32 Black    NY    New York        CV     ""       Robb~ 20 to L~      2006       2012 0     1     0     0     0     derp  derp 
## 5 Abrego    Eruby           20 Hispanic IL    Cook            CDC, ~ "OF, WH~ Murd~ 90 years      2004       2022 0     1     1     1     0     OM    derp 
## 6 Acero     Longino         35 Hispanic CA    Santa Clara     NC, P  ""       Sex ~ 2 years~      1994       2006 0     0     0     0     0     derp  ILD  
## # ... with 1 more variable: count_3299 <lgl>

Check out this page for a quick overview.

9.2 Read SPSS files into R

Use foreign::read.spss

spss_version=foreign::read.spss(here::here("JLWOP", "Data and Models", "JLWOP_RYAN.sav"), to.data.frame = TRUE)

Might also want to add as_tibble() on the end.

9.3 Turn numbers into percentages

Use scales::percent(), which converts normal numbers into percentages and includes the percent sign (%) afterwards

simple_table=tribble(~n_people, ~votes_in_favor,
                     25, 14)

simple_table=simple_table %>% mutate(percent_voted_for=scales::percent(votes_in_favor/n_people, accuracy = 0.1, scale = 100))

simple_table

## # A tibble: 1 x 3
##   n_people votes_in_favor percent_voted_for
##      <dbl>          <dbl> <chr>            
## 1       25             14 56.0%

Scale is what to multiple the original number by (e.g., convert 0.05 to 5% by x100) Accuracy controls how many places out the decimal goes

9.4 Find all possible combindations of items in a vector

y <- c(2,4,6,8)

combn(c(2,4,6,8),2) # find all possible combinations of these numbers, drawn two at a time

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    2    2    4    4    6
## [2,]    4    6    8    6    8    8

9.5 Download files from the internet

9.6 Print multiple things in one statement

Use cat() from base R

cat("The p-value dropped below 0.05 for the first time as sample size", 100)

## The p-value dropped below 0.05 for the first time as sample size 100