Chapter 9 Misc. Stuff
9.1 Scrape web pages for data tables
Note. See Chapter 10’s example purrr
walk through for a guide on how to scrape multiple web tables simultaneously
Simple example.
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(tidyverse)
=read_html('https://shop.tcgplayer.com/price-guide/pokemon/base-set') %>%
htmlhtml_table(fill = TRUE)
html
## [[1]]
## # A tibble: 101 x 6
## PRODUCT Rarity Number `Market Price` `Listed Median` ``
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Abra Common 043/102 $0.39 $0.46 View
## 2 Alakazam Holo Rare 001/102 $40.64 — View
## 3 Arcanine Uncommon 023/102 $2.94 $2.41 View
## 4 Beedrill Rare 017/102 $3.12 $3.20 View
## 5 Bill Common 091/102 $0.25 $0.30 View
## 6 Blastoise Holo Rare 002/102 $119.59 — View
## 7 Bulbasaur Common 044/102 $1.39 $2.21 View
## 8 Caterpie Common 045/102 $0.70 $0.75 View
## 9 Chansey Holo Rare 003/102 $25.35 — View
## 10 Charizard Holo Rare 004/102 $348.99 — View
## # ... with 91 more rows
# Saved as a list by default. Now extract your table from said list
=as_tibble(html[[1]] %>% # find out which number it is in the list
htmlselect('PRODUCT','Rarity','Number','Market Price')) # if needed, specify which columns you want too
html
## # A tibble: 101 x 4
## PRODUCT Rarity Number `Market Price`
## <chr> <chr> <chr> <chr>
## 1 Abra Common 043/102 $0.39
## 2 Alakazam Holo Rare 001/102 $40.64
## 3 Arcanine Uncommon 023/102 $2.94
## 4 Beedrill Rare 017/102 $3.12
## 5 Bill Common 091/102 $0.25
## 6 Blastoise Holo Rare 002/102 $119.59
## 7 Bulbasaur Common 044/102 $1.39
## 8 Caterpie Common 045/102 $0.70
## 9 Chansey Holo Rare 003/102 $25.35
## 10 Charizard Holo Rare 004/102 $348.99
## # ... with 91 more rows
# remove $ symbol in Price column to make it easier to work with
$`Market Price`=str_remove(html$`Market Price`, pattern = "\\$")
html
=html %>% mutate(`Market Price`=as.numeric(`Market Price`)) # convert from string to numeric
html
# view finished table
head(html)
## # A tibble: 6 x 4
## PRODUCT Rarity Number `Market Price`
## <chr> <chr> <chr> <dbl>
## 1 Abra Common 043/102 0.39
## 2 Alakazam Holo Rare 001/102 40.6
## 3 Arcanine Uncommon 023/102 2.94
## 4 Beedrill Rare 017/102 3.12
## 5 Bill Common 091/102 0.25
## 6 Blastoise Holo Rare 002/102 120.
Slightly more complicated example
Reading a table into R takes a few steps.
Step 1 is to copy and paste the URL into the read_html()
verb like below:
::p_load(rvest, tidyverse)
pacman
=read_html("https://www.law.umich.edu/special/exoneration/Pages/detaillist.aspx") %>%
exonerations_tablehtml_nodes("table.ms-listviewtable") %>%
html_table(fill=TRUE, header = TRUE)
Sometimes if the web page is extremely basic and pretty much the only thing on it is a table, you can stop there. Most of the time though, there will be tons of other stuff on the website and you need to get more specific so R can find the table. This is the html_nodes()
part of the above command; in there you specify the exact part of the web page where the table is located/what object file it is.
To find this you will need to use the Developer mode in your browser. See this screenshot for an example…
::include_graphics(here::here("pics", "scrape.png")) knitr
In Firefox you open this by going to Settings > More Tools > Web Developer Tools (or CNTRL + Shift + I).
Begin by looking through the console in the center bottom for names that look like they would be related to your table. A good place to start might be ““, which contains the main body of the web page. Click on a name to expand it and see all the elements on the page contained there.
Ultimately what you’re looking for is what you see above: an element that, when selected, highlights ONLY the area of the web page you’re looking for. To get at this you will need to keep expanding, highlighting, and clicking repeatedly….it can take some digging.
Keep drilling down through page elements until you find the one that highlights the table and just the table. When you find this, look for the .ms file in that name; you should also see this in the smaller console box on the right. That is the file you’ll need. Write that name in the html_node
command and read it into R.
That’s stage 1. From here you now need to clean up the table.
=as.data.frame(exonerations_table) # convert into a df exonerations_table
Your table might be different, but this one’s names were messed up when read in, so lets fix those first and then fix the rows and columns.
# save the names to a vector
=exonerations_table$Last.Name[1:20]
table_names
# Trim out the garbage rows and columns
=exonerations_table %>%
exonerations_tableselect(Last.Name:Tags.1) %>%
slice(22:n())
# over-write incorrect col names with the vector of correct ones we saved above
colnames(exonerations_table)=table_names
# clean up names
=exonerations_table %>% janitor::clean_names()
exonerations_table
# verify structure of columns is correct
# glimpse(exonerations_table)
Yikes, a lot of stuff is stored incorrectly, and as a result there’s some missing values that need to be addressed and other data that needs to be corrected.
=as_tibble(exonerations_table) %>% # convert to tibble
exonerations_tablemutate(across(c(dna,mwid:ild), na_if,"")) %>% # turn missing values into NA's
mutate(across(c(dna,mwid:ild), replace_na, "derp")) %>% # replace NA's with a string (required for the next lines to work)
mutate(dna=ifelse(dna=="DNA",1,0), # change these variables from text to numeric to better facilitate analysis
mwid=ifelse(mwid=="MWID",1,0),
fc=ifelse(fc=="FC",1,0),
p_fa=ifelse(p_fa=="P/FA",1,0),
f_mfe=ifelse(f_mfe=="F/MFE",1,0)) %>%
mutate(across(c(st, crime, dna:f_mfe),factor)) # correct form by converting to factors
And that’s it! Check out final result!
head(exonerations_table)
## # A tibble: 6 x 20
## last_name first_name age race st county_of_crime tags om_tags crime sentence convicted exonerated dna mwid fc p_fa f_mfe om ild
## <chr> <chr> <int> <chr> <fct> <chr> <chr> <chr> <fct> <chr> <int> <int> <fct> <fct> <fct> <fct> <fct> <chr> <chr>
## 1 Abbott Cinque 19 Black IL Cook CIU, ~ "OF, WH~ Drug~ Probati~ 2008 2022 0 0 0 1 0 OM derp
## 2 Abdal Warith Habib 43 Black NY Erie IO, SA "OF, WH~ Sexu~ 20 to L~ 1983 1999 1 1 0 0 1 OM derp
## 3 Abernathy Christopher 17 White IL Cook CIU, ~ "OF, WH~ Murd~ Life wi~ 1987 2015 1 0 1 1 0 OM derp
## 4 Abney Quentin 32 Black NY New York CV "" Robb~ 20 to L~ 2006 2012 0 1 0 0 0 derp derp
## 5 Abrego Eruby 20 Hispanic IL Cook CDC, ~ "OF, WH~ Murd~ 90 years 2004 2022 0 1 1 1 0 OM derp
## 6 Acero Longino 35 Hispanic CA Santa Clara NC, P "" Sex ~ 2 years~ 1994 2006 0 0 0 0 0 derp ILD
## # ... with 1 more variable: count_3299 <lgl>
Check out this page for a quick overview.
9.2 Read SPSS files into R
Use foreign::read.spss
=foreign::read.spss(here::here("JLWOP", "Data and Models", "JLWOP_RYAN.sav"), to.data.frame = TRUE) spss_version
Might also want to add as_tibble()
on the end.
9.3 Turn numbers into percentages
Use scales::percent()
, which converts normal numbers into percentages and includes the percent sign (%) afterwards
=tribble(~n_people, ~votes_in_favor,
simple_table25, 14)
=simple_table %>% mutate(percent_voted_for=scales::percent(votes_in_favor/n_people, accuracy = 0.1, scale = 100))
simple_table
simple_table
## # A tibble: 1 x 3
## n_people votes_in_favor percent_voted_for
## <dbl> <dbl> <chr>
## 1 25 14 56.0%
Scale is what to multiple the original number by (e.g., convert 0.05 to 5% by x100) Accuracy controls how many places out the decimal goes