Web Scraping with R

Lecture notes
Author
Affiliation

UiT The Arctic University of Norway

Published

February 13, 2023

Web Scraping: what? why?

  • Increasing amount of data is available on the web, provided in an unstructured format. You may copy and paste, but it is time consuming and prone to errors. So, instead, we can use R packages to scrape the data.

  • Web scraping is the process of extracting this information or data automatically and transform it into a structured dataset.

  • Two different scenarios:

    • Screen scraping: extract data from source code of website, which is called html parser.

    • Web APIs(application programming interface): some website offers a set of structured http requests that return JSON or XML files. So, we can use these web API’s interfaces to request and download data in our R environment.

Web Scraping using rvest R package

  • When scrapping web pages, we analyse the “source code (specifically the html code)” behind the web page. In most browsers, you can easily view the source code by right-clicking on the page and selecting “inspect” or view page source” or similar.

  • Elements in webpage are marked with so-called “html tags”.For instance, we often are interested to scrape html tables. The table we are interested in is very often inside html “table tags”.

  • Do a text search in the source code for “<table>”. There is only one place in the document and marks the beginning of the table. If you search again with “</table>” you will find where the table ends. In order to extract the content of the table, we must search for these tags, after we have identified the text between the “table” tags.

  • In R, there are several packages that can do web scraping. In this tutorial, we’ll have a closer look at one of them, rvest package.

  • If you search “web scraping rvest” online, you will find many resources. For instance, in youtube,you could watch this video. More resources here and there

  • The examples below illustrate how to scrape html table from various website using rvest package in R.

# we first start loading relevant packages and libraries to scrap data and perform the analysis  
library(tidyverse)
library(rvest)  # for web scraping 
library(janitor) # for data cleaning 
library(lubridate) #to work with date
library(quantmod) 

Example 1

  • In this example, we scrape the world happiness 2020 report from the wikipidia webpage. To scrape a table from this webpage, you could right-click on the page and choose to see the source code.
# URl of the data 
url <- "https://en.wikipedia.org/wiki/World_Happiness_Report#2021_report"

url %>% read_html() %>% 
  html_element("table.wikitable") %>%  
  html_table() %>% head(5)
# A tibble: 5 × 2
  `Overall rank` `Country or region`
           <int> <chr>              
1              1 Finland            
2              2 Denmark            
3              3 Iceland            
4              4 Switzerland        
5              5 Netherlands        
  • The above codes just pick only the first table on the page, which is not our interest here.
  • Try the following code:
url %>%
   read_html() %>% 
   html_nodes("table") %>%  
   html_table() %>% .[[13]] %>% head(5)
# A tibble: 5 × 9
  `Overall rank` Country…¹ Score GDP p…² Socia…³ Healt…⁴ Freed…⁵ Gener…⁶ Perce…⁷
           <int> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1              1 Finland    7.81    1.28    1.5    0.961   0.662   0.16    0.478
2              2 Denmark    7.65    1.33    1.50   0.979   0.665   0.243   0.495
3              3 Switzerl…  7.56    1.39    1.47   1.04    0.629   0.269   0.408
4              4 Iceland    7.50    1.33    1.55   1.00    0.662   0.362   0.145
5              5 Norway     7.49    1.42    1.50   1.01    0.67    0.288   0.434
# … with abbreviated variable names ¹​`Country or region`, ²​`GDP per capita`,
#   ³​`Social support`, ⁴​`Healthy life expectancy`,
#   ⁵​`Freedom to make life choices`, ⁶​Generosity, ⁷​`Perceptions of corruption`
#Assigning and renaming 
 table <- url %>%
   read_html() %>% 
   html_nodes("table") %>%  
   html_table() %>% .[[13]]

 names(table) <- c("rank", "country","score", "GDPP","social_support","LEXP","Freedom","Generosity","Pcorruption")
 
 head(table,5)
# A tibble: 5 × 9
   rank country     score  GDPP social_support  LEXP Freedom Generosity Pcorru…¹
  <int> <chr>       <dbl> <dbl>          <dbl> <dbl>   <dbl>      <dbl>    <dbl>
1     1 Finland      7.81  1.28           1.5  0.961   0.662      0.16     0.478
2     2 Denmark      7.65  1.33           1.50 0.979   0.665      0.243    0.495
3     3 Switzerland  7.56  1.39           1.47 1.04    0.629      0.269    0.408
4     4 Iceland      7.50  1.33           1.55 1.00    0.662      0.362    0.145
5     5 Norway       7.49  1.42           1.50 1.01    0.67       0.288    0.434
# … with abbreviated variable name ¹​Pcorruption
# Some plot using the data
# plot only upper 20 happiest country 
table %>% head(20) %>% 
  ggplot(aes(x= score, fct_reorder(country, score))) +
  geom_point()+ xlab("Happiness score") + ylab("country") + ggtitle('Top 20 happiest country')

# only lowest 20 countries 
table %>% tail(20) %>% 
  ggplot(aes(x= score, fct_reorder(country, score))) +
  geom_point()+xlab("Happiness score") + ylab("country") + ggtitle('Lower 20 happiest country')

Example 2

  • In this example, we scrap the interest rate barometer of the Norwegian Family Economy from the webpage. Right-click on the page to see the source code.
# URl of the data 
url <- "https://www.norskfamilie.no/barometre/rentebarometer/"

table <- url %>%
  read_html() %>% 
  html_nodes("table") %>% 
  html_table() 
head(table[[1]])
# A tibble: 6 × 9
     `` Bank                  ``    Navn  Nomin…¹ Sikke…² Etabl…³ Termin Effek…⁴
  <int> <chr>                 <lgl> <chr> <chr>   <chr>   <chr>    <int> <chr>  
1     1 Statens pensjonskasse NA    Boli… 2,96    0       0           50 3,06   
2     2 Nordnet Bank (Nordne… NA    Boli… 3,39    0       0            0 3,44   
3     3 Høland og Setskog Sp… NA    Grøn… 3,54    0       0           75 3,69   
4     4 NORDEA BANK ABP, FIL… NA    ECON… 3,59    0       0           45 3,71   
5     5 Nordnet Bank (Nordne… NA    Boli… 3,74    0       0            0 3,80   
6     6 SpareBank 1 Nordmøre  NA    Grøn… 3,65    1 200   0           75 3,80   
# … with abbreviated variable names ¹​Nominell, ²​Sikkerhetsgebyr,
#   ³​Etableringsgebyr, ⁴​Effektiv

Example 3

  • Not all website are easy to scrap.

  • In this example we scrape a table from the web-page.

  • The table contains data of taxpayers average income & wealth per birth year in Norway, from 1917 to 2003.

# URL of the data 
url <- "https://www.nettavisen.no/skattetall/penger/skattelistene-2022/stor-oversikt-dette-er-snittlonnen-for-ditt-arskull-pa-ditt-postnummer/s/5-95-806984"

url %>% read_html() %>% 
  html_elements("table") %>%  
  html_table()
list()
  • Empty list!. So how can get the table?
  • Hover over the html page, we can see that the table is
    embedded in an iframe tag.
  • In this tag, we can get the hidden datawrapper link of the data.
  • Get the datawrapper link (this link starts with src), and open it in new window.
  • Then press F12. That will open DevTools of the table.
  • Then find the .csv file from this page by pressing the keyboard Ctrl + R.
##Read data from the csv file of the data and look at the structure  
raw_dataset <- read_csv("https://datawrapper.dwcdn.net/sBuxT/3/dataset.csv")
str(raw_dataset)
spc_tbl_ [977 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ birthyr             : num [1:977] 1917 1918 1919 1920 1921 ...
 $ n_taxpayers         : num [1:977] 53 76 162 257 420 ...
 $ income              : num [1:977] 110184 118620 139005 165488 169149 ...
 $ fortune             : num [1:977] 495961 637675 886737 793308 915451 ...
 $ n_millionincome     : num [1:977] 1 0 1 1 2 1 2 3 7 15 ...
 $ n_millionfortune    : num [1:977] 11 19 41 71 128 ...
 $ Andel millioninntekt: chr [1:977] "1.89%" "0.00%" "0.62%" "0.39%" ...
 $ Andel millionformue : chr [1:977] "20.75%" "25.00%" "25.31%" "27.63%" ...
 - attr(*, "spec")=
  .. cols(
  ..   birthyr = col_double(),
  ..   n_taxpayers = col_number(),
  ..   income = col_number(),
  ..   fortune = col_number(),
  ..   n_millionincome = col_number(),
  ..   n_millionfortune = col_number(),
  ..   `Andel millioninntekt` = col_character(),
  ..   `Andel millionformue` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
  • The data frame contains eight variables.

  • Some cleaning of the data and defining the new variables.

#clean the data frame using the fun called "clean_names" 
#from the "janitor" package. 
dataset <-
  raw_dataset %>%
  clean_names()

#Filtering to make sure we do not have text at the end  
dataset <-
  dataset %>%
  filter(birthyr <= 2003)

# Define the new variables
dataset <-
  dataset %>% 
  mutate(share_mill_inc = 100*n_millionincome/n_taxpayers,
         share_mill_wealth = 100*n_millionfortune/n_taxpayers)
#head(dataset)
#tail(dataset)
  • Ploting birth year versus share of taxpayers with million wealth.
dataset %>% 
  ggplot(aes(x=birthyr, y=share_mill_wealth)) +
  geom_point() +
  xlab("Birth year") +
  ylab("Share of taxpayers with million wealth") +
  theme_bw()

  • It is obvious that the plot above is non-linear.

Example 4:

  • In this example, we will scrape our SOK-1005 course time plan for this semester. Here is Sok-1005 time plan list for this semester.
#URL
url <- "https://timeplan.uit.no/emne_timeplan.php?sem=23v&module%5B%5D=SOK-1005-1&week=1-20&View=list"
page <- read_html(url)

table <- html_nodes(page, 'table') # one table per week
table <- html_table(table, fill=TRUE) # force them into a list

table[[1]]
# A tibble: 3 × 6
  `Uke 3`           ``          ``         ``         ``          ``         
  <chr>             <chr>       <chr>      <chr>      <chr>       <chr>      
1 Dato              Tid         Rom        Emnekode   Beskrivelse Lærer      
2 Tirsdag17.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
3 Torsdag19.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
table[[6]]
# A tibble: 2 × 6
  `Uke 8`          ``          ``         ``         ``          ``         
  <chr>            <chr>       <chr>      <chr>      <chr>       <chr>      
1 Dato             Tid         Rom        Emnekode   Beskrivelse Lærer      
2 Mandag20.02.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
#Stack the tables into a list 
library(rlist)
dframe <- list.stack(table, use.names = FALSE ) 
head(dframe,5)
              Uke 3          V1          V2         V3          V4          V5
1              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
2 Tirsdag17.01.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
3 Torsdag19.01.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
4              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
5 Tirsdag24.01.2023 14:15-16:00 FARM F1.101 SOK-1005-1 Forelesning D.G. Kidane
# define first row as variable name
colnames(dframe) <- dframe[1,]
dframe
                Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
1               Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
2  Tirsdag17.01.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
3  Torsdag19.01.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
4               Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
5  Tirsdag24.01.2023 14:15-16:00 FARM F1.101 SOK-1005-1 Forelesning D.G. Kidane
6               Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
7   Mandag30.01.2023 10:15-12:00  HHT 02.217 SOK-1005-1     Seminar            
8               Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
9  Tirsdag07.02.2023 12:15-14:00  HHT 02.217 SOK-1005-1     Seminar            
10              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
11 Tirsdag14.02.2023 14:15-16:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
12              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
13  Mandag20.02.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
14              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
15 Tirsdag28.02.2023 14:15-16:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
16              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
17 Tirsdag07.03.2023 14:15-16:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
18              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
19 Tirsdag14.03.2023 12:15-14:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
20              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
21 Tirsdag21.03.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
22              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
23  Fredag31.03.2023 14:15-16:00  HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
24              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
25 Torsdag13.04.2023 10:15-12:00 FARM F1.101 SOK-1005-1 Forelesning  A. Jenssen
26              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
27 Torsdag20.04.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning  A. Jenssen
28              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
29 Torsdag27.04.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning  A. Jenssen
30              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
31 Torsdag04.05.2023 12:15-14:00  HHT 01.202 SOK-1005-1 Forelesning  A. Jenssen
32              Dato         Tid         Rom   Emnekode Beskrivelse       Lærer
33 Torsdag11.05.2023 10:15-12:00  HHT 01.202 SOK-1005-1 Forelesning  A. Jenssen
# remove the rows with Dato in it
dframe <- dframe %>% filter(!Dato=="Dato")

str(dframe)
'data.frame':   17 obs. of  6 variables:
 $ Dato       : chr  "Tirsdag17.01.2023" "Torsdag19.01.2023" "Tirsdag24.01.2023" "Mandag30.01.2023" ...
 $ Tid        : chr  "10:15-12:00" "10:15-12:00" "14:15-16:00" "10:15-12:00" ...
 $ Rom        : chr  "HHT 01.202" "HHT 01.202" "FARM F1.101" "HHT 02.217" ...
 $ Emnekode   : chr  "SOK-1005-1" "SOK-1005-1" "SOK-1005-1" "SOK-1005-1" ...
 $ Beskrivelse: chr  "Forelesning" "Forelesning" "Forelesning" "Seminar" ...
 $ Lærer      : chr  "D.G. Kidane" "D.G. Kidane" "D.G. Kidane" "" ...
# Separate the Dato into two columns:
dframe <- dframe %>% separate(Dato, 
                              into = c("Dag", "Dato"), 
                              sep = "(?<=[A-Za-z])(?=[0-9])")                                                   #sep, https://stackoverflow.com/questions/9756360/split-character-data-into-numbers-and-letters
# code into date format
dframe$Dato <- as.Date(dframe$Dato, format="%d.%m.%Y")
# generate a week variable
dframe$Uke <- strftime(dframe$Dato, format = "%V")
# select
dframe <- dframe %>% select(Dag,Dato,Uke,Tid,Rom)
dframe
       Dag       Dato Uke         Tid         Rom
1  Tirsdag 2023-01-17  03 10:15-12:00  HHT 01.202
2  Torsdag 2023-01-19  03 10:15-12:00  HHT 01.202
3  Tirsdag 2023-01-24  04 14:15-16:00 FARM F1.101
4   Mandag 2023-01-30  05 10:15-12:00  HHT 02.217
5  Tirsdag 2023-02-07  06 12:15-14:00  HHT 02.217
6  Tirsdag 2023-02-14  07 14:15-16:00  HHT 01.202
7   Mandag 2023-02-20  08 10:15-12:00  HHT 01.202
8  Tirsdag 2023-02-28  09 14:15-16:00  HHT 01.202
9  Tirsdag 2023-03-07  10 14:15-16:00  HHT 01.202
10 Tirsdag 2023-03-14  11 12:15-14:00  HHT 01.202
11 Tirsdag 2023-03-21  12 10:15-12:00  HHT 01.202
12  Fredag 2023-03-31  13 14:15-16:00  HHT 01.202
13 Torsdag 2023-04-13  15 10:15-12:00 FARM F1.101
14 Torsdag 2023-04-20  16 10:15-12:00  HHT 01.202
15 Torsdag 2023-04-27  17 10:15-12:00  HHT 01.202
16 Torsdag 2023-05-04  18 12:15-14:00  HHT 01.202
17 Torsdag 2023-05-11  19 10:15-12:00  HHT 01.202
str(dframe)
'data.frame':   17 obs. of  5 variables:
 $ Dag : chr  "Tirsdag" "Torsdag" "Tirsdag" "Mandag" ...
 $ Dato: Date, format: "2023-01-17" "2023-01-19" ...
 $ Uke : chr  "03" "03" "04" "05" ...
 $ Tid : chr  "10:15-12:00" "10:15-12:00" "14:15-16:00" "10:15-12:00" ...
 $ Rom : chr  "HHT 01.202" "HHT 01.202" "FARM F1.101" "HHT 02.217" ...

Task

Build a procedure that can scrape “many” courses, e.g. the 3 courses you have this semester, from a list

Example 5

  • We can also directly use R packages/libraries to scrape data into our R session. For instance, we can use the loadSymbols function from the quantmod package to scrape the stock prices of Exxon Mobile Corporation (XOM), traded at NYSE.
  • The XOM stock prices can be obtained from The Yahoo! Fiance site.
# Import XOM data from yahoo finance
loadSymbols("XOM", src="yahoo" )
[1] "XOM"
xom_data <- XOM
head(xom_data)
           XOM.Open XOM.High XOM.Low XOM.Close XOM.Volume XOM.Adjusted
2007-01-03    76.26    76.27   73.51     74.11   30510700     42.43665
2007-01-04    73.85    73.90   72.21     72.72   31046600     41.64070
2007-01-05    72.45    73.38   72.22     73.24   24671500     41.93845
2007-01-08    73.88    74.25   71.85     72.65   25981900     41.60062
2007-01-09    71.85    72.84   71.44     72.09   27039900     41.27994
2007-01-10    71.81    73.11   70.64     70.99   30572000     40.65007
class(xom_data) # this is xts or zoo
[1] "xts" "zoo"
# convert it to a data frame 
xom.df <- as.data.frame(xom_data)  #data frame 
class(xom.df)
[1] "data.frame"
head(xom.df)
           XOM.Open XOM.High XOM.Low XOM.Close XOM.Volume XOM.Adjusted
2007-01-03    76.26    76.27   73.51     74.11   30510700     42.43665
2007-01-04    73.85    73.90   72.21     72.72   31046600     41.64070
2007-01-05    72.45    73.38   72.22     73.24   24671500     41.93845
2007-01-08    73.88    74.25   71.85     72.65   25981900     41.60062
2007-01-09    71.85    72.84   71.44     72.09   27039900     41.27994
2007-01-10    71.81    73.11   70.64     70.99   30572000     40.65007

Data Wrangling

  • The above data is daily data. We can do some data wrangling to convert the daily data into monthly data. Let us consider only the Xom.Adjusted price from January 4th of 2010 to December 30th of 2022.
# calculate the monthly arithmetic average of xom price data.
# Also calculate the weighted average of xom, using volume traded as a weight.
xom <- xom.df %>% 
  mutate(Date = ymd(rownames(xom.df))) %>% # create a date variable from the row names
  filter(Date >="2010-01-04",Date<"2023-01-01") %>%  # filter data only between the specified dates 
  mutate(year = year(Date), #create the variable called year from the date variable 
         month = month(Date), #create the variable month from the date variable 
         day = day(Date)) %>% #create the variables day from the date variable 
  group_by(year,month) %>% 
  # calculate arithmetic average of xom.adjusted
  summarise(xom_monthly_art= mean(XOM.Adjusted),
    #calculate weighted average using volume as a weight.
            xom_monthly_wt=sum(XOM.Adjusted*XOM.Volume)/sum(XOM.Volume)) %>% 
  mutate(date = make_date(year,month,1)) %>% as_tibble() %>%  # make the date variable, setting all day's as first day 
  select(date, xom_monthly_wt)

#output
head(xom,5)
# A tibble: 5 × 2
  date       xom_monthly_wt
  <date>              <dbl>
1 2010-01-01           41.2
2 2010-02-01           39.9
3 2010-03-01           40.7
4 2010-04-01           41.8
5 2010-05-01           38.6
# Plot  
xom %>% ggplot(aes(x = date, y = xom_monthly_wt))+geom_point()

Note

Not all website owners think it’s okay for us to scrap their websites. Hence, they may make their websites hard to scrape.Furthermore,there might be also ethical concerns to scrape a website. However, when someone posts data on a web page, they have made the data public and cannot decide how to read the data. What may be illegal is to disclose the data commercially.

Example 6

Here read Web Scraping with R studio and rvest

Task:

  1. Use code to download from FRED the daily Brent Crude Oil Price from January 4th 2010 to December 30, 2022.

  2. Perform some data wrangling and convert the daily price to month price.

  3. Plot the monthly price against date.

API’s

An application programming interface (API) is a computing interface which in our case defines the interaction between our R session and a server with data. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used. Each API has its specified structure, but an API can automate your data collection. The basic steps are as follows:

Use the “httr” and “jsonlite” packages. Make a “GET” request to the API to pull raw data into your environment. “Parse” that data from its raw form through JavaScript Object Notification (JSON) into a usable format. If needed, loop through each “page” and retrieve the full data set. This RStudio webinar is on the basics, and some of the pitfalls of calling web API’s within R.

As an example, we will connect to: https://reqres.in a real API where we can test and learn the procedure. This is the R code for this example.

Reading data from Statistics Norway/SSB (using API, JSON or static web-scraping) Some data providers, like Statistics Norway (SSB) offer different options for data downloads. The data used in this example is found here.

We will download tables, 95274 (by counties)

Myfigure

We could either scrape it as a static web-page (csv link), scrape it as a JSON object (json link) or use an API with SSB’s R package PxWebApiData. Links with examples from R are found here. We will use the rjstat package to read JSON data.

The data is in long format. The names of the four columns in english should read: “region”, “date”, “variable”, “value”.

Challenges

We are faced with a couple of challenges. If we try to read the static web-page (csv) note that SSB uses “,” as a decimal point instead of “.” while this is not an issue using the API or JSON format. SSB uses the norwegian letter “æ, ø, å” in some labels.

The date format is a bit awkward, it uses a four digit year, then M for “month” and the month number with two digits. August 2020 would then be “2020M08”. It is formatted as a “character”. We need to change the date variable into a “date format”.

We will try different options to recode the Norwegian labels into English. The recode function in dplyr and in the car package, and ifelse statements.

When joining the two dataframes we have the data in long format. This is the R code for this example.