# we first start loading relevant packages and libraries to scrap data and perform the analysis
library(tidyverse)
library(rvest) # for web scraping
library(janitor) # for data cleaning
library(lubridate) #to work with date
library(quantmod)
Web Scraping: what? why?
Increasing amount of data is available on the web, provided in an unstructured format. You may copy and paste, but it is time consuming and prone to errors. So, instead, we can use R packages to scrape the data.
Web scraping is the process of extracting this information or data automatically and transform it into a structured dataset.
Two different scenarios:
Screen scraping: extract data from source code of website, which is called html parser.
Web APIs(application programming interface): some website offers a set of structured http requests that return JSON or XML files. So, we can use these web API’s interfaces to request and download data in our R environment.
Web Scraping using rvest R package
When scrapping web pages, we analyse the “source code (specifically the html code)” behind the web page. In most browsers, you can easily view the source code by right-clicking on the page and selecting “inspect” or view page source” or similar.
Elements in webpage are marked with so-called “html tags”.For instance, we often are interested to scrape html tables. The table we are interested in is very often inside html “table tags”.
Do a text search in the source code for “<table>”. There is only one place in the document and marks the beginning of the table. If you search again with “</table>” you will find where the table ends. In order to extract the content of the table, we must search for these tags, after we have identified the text between the “table” tags.
In R, there are several packages that can do web scraping. In this tutorial, we’ll have a closer look at one of them, rvest package.
If you search “web scraping rvest” online, you will find many resources. For instance, in youtube,you could watch this video. More resources here and there
The examples below illustrate how to scrape html table from various website using rvest package in R.
Example 1
- In this example, we scrape the world happiness 2020 report from the wikipidia webpage. To scrape a table from this webpage, you could right-click on the page and choose to see the source code.
# URl of the data
<- "https://en.wikipedia.org/wiki/World_Happiness_Report#2021_report"
url
%>% read_html() %>%
url html_element("table.wikitable") %>%
html_table() %>% head(5)
# A tibble: 5 × 2
`Overall rank` `Country or region`
<int> <chr>
1 1 Finland
2 2 Denmark
3 3 Iceland
4 4 Switzerland
5 5 Netherlands
- The above codes just pick only the first table on the page, which is not our interest here.
- Try the following code:
%>%
url read_html() %>%
html_nodes("table") %>%
html_table() %>% .[[13]] %>% head(5)
# A tibble: 5 × 9
`Overall rank` Country…¹ Score GDP p…² Socia…³ Healt…⁴ Freed…⁵ Gener…⁶ Perce…⁷
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Finland 7.81 1.28 1.5 0.961 0.662 0.16 0.478
2 2 Denmark 7.65 1.33 1.50 0.979 0.665 0.243 0.495
3 3 Switzerl… 7.56 1.39 1.47 1.04 0.629 0.269 0.408
4 4 Iceland 7.50 1.33 1.55 1.00 0.662 0.362 0.145
5 5 Norway 7.49 1.42 1.50 1.01 0.67 0.288 0.434
# … with abbreviated variable names ¹`Country or region`, ²`GDP per capita`,
# ³`Social support`, ⁴`Healthy life expectancy`,
# ⁵`Freedom to make life choices`, ⁶Generosity, ⁷`Perceptions of corruption`
#Assigning and renaming
<- url %>%
table read_html() %>%
html_nodes("table") %>%
html_table() %>% .[[13]]
names(table) <- c("rank", "country","score", "GDPP","social_support","LEXP","Freedom","Generosity","Pcorruption")
head(table,5)
# A tibble: 5 × 9
rank country score GDPP social_support LEXP Freedom Generosity Pcorru…¹
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Finland 7.81 1.28 1.5 0.961 0.662 0.16 0.478
2 2 Denmark 7.65 1.33 1.50 0.979 0.665 0.243 0.495
3 3 Switzerland 7.56 1.39 1.47 1.04 0.629 0.269 0.408
4 4 Iceland 7.50 1.33 1.55 1.00 0.662 0.362 0.145
5 5 Norway 7.49 1.42 1.50 1.01 0.67 0.288 0.434
# … with abbreviated variable name ¹Pcorruption
# Some plot using the data
# plot only upper 20 happiest country
%>% head(20) %>%
table ggplot(aes(x= score, fct_reorder(country, score))) +
geom_point()+ xlab("Happiness score") + ylab("country") + ggtitle('Top 20 happiest country')
# only lowest 20 countries
%>% tail(20) %>%
table ggplot(aes(x= score, fct_reorder(country, score))) +
geom_point()+xlab("Happiness score") + ylab("country") + ggtitle('Lower 20 happiest country')
Example 2
- In this example, we scrap the interest rate barometer of the Norwegian Family Economy from the webpage. Right-click on the page to see the source code.
# URl of the data
<- "https://www.norskfamilie.no/barometre/rentebarometer/"
url
<- url %>%
table read_html() %>%
html_nodes("table") %>%
html_table()
head(table[[1]])
# A tibble: 6 × 9
`` Bank `` Navn Nomin…¹ Sikke…² Etabl…³ Termin Effek…⁴
<int> <chr> <lgl> <chr> <chr> <chr> <chr> <int> <chr>
1 1 Statens pensjonskasse NA Boli… 2,96 0 0 50 3,06
2 2 Nordnet Bank (Nordne… NA Boli… 3,39 0 0 0 3,44
3 3 Høland og Setskog Sp… NA Grøn… 3,54 0 0 75 3,69
4 4 NORDEA BANK ABP, FIL… NA ECON… 3,59 0 0 45 3,71
5 5 Nordnet Bank (Nordne… NA Boli… 3,74 0 0 0 3,80
6 6 SpareBank 1 Nordmøre NA Grøn… 3,65 1 200 0 75 3,80
# … with abbreviated variable names ¹Nominell, ²Sikkerhetsgebyr,
# ³Etableringsgebyr, ⁴Effektiv
Example 3
Not all website are easy to scrap.
In this example we scrape a table from the web-page.
The table contains data of taxpayers average income & wealth per birth year in Norway, from 1917 to 2003.
# URL of the data
<- "https://www.nettavisen.no/skattetall/penger/skattelistene-2022/stor-oversikt-dette-er-snittlonnen-for-ditt-arskull-pa-ditt-postnummer/s/5-95-806984"
url
%>% read_html() %>%
url html_elements("table") %>%
html_table()
list()
- Empty list!. So how can get the table?
- Hover over the html page, we can see that the table is
embedded in an iframe tag. - In this tag, we can get the hidden datawrapper link of the data.
- Get the datawrapper link (this link starts with src), and open it in new window.
- Then press F12. That will open DevTools of the table.
- Then find the .csv file from this page by pressing the keyboard Ctrl + R.
##Read data from the csv file of the data and look at the structure
<- read_csv("https://datawrapper.dwcdn.net/sBuxT/3/dataset.csv")
raw_dataset str(raw_dataset)
spc_tbl_ [977 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ birthyr : num [1:977] 1917 1918 1919 1920 1921 ...
$ n_taxpayers : num [1:977] 53 76 162 257 420 ...
$ income : num [1:977] 110184 118620 139005 165488 169149 ...
$ fortune : num [1:977] 495961 637675 886737 793308 915451 ...
$ n_millionincome : num [1:977] 1 0 1 1 2 1 2 3 7 15 ...
$ n_millionfortune : num [1:977] 11 19 41 71 128 ...
$ Andel millioninntekt: chr [1:977] "1.89%" "0.00%" "0.62%" "0.39%" ...
$ Andel millionformue : chr [1:977] "20.75%" "25.00%" "25.31%" "27.63%" ...
- attr(*, "spec")=
.. cols(
.. birthyr = col_double(),
.. n_taxpayers = col_number(),
.. income = col_number(),
.. fortune = col_number(),
.. n_millionincome = col_number(),
.. n_millionfortune = col_number(),
.. `Andel millioninntekt` = col_character(),
.. `Andel millionformue` = col_character()
.. )
- attr(*, "problems")=<externalptr>
The data frame contains eight variables.
Some cleaning of the data and defining the new variables.
#clean the data frame using the fun called "clean_names"
#from the "janitor" package.
<-
dataset %>%
raw_dataset clean_names()
#Filtering to make sure we do not have text at the end
<-
dataset %>%
dataset filter(birthyr <= 2003)
# Define the new variables
<-
dataset %>%
dataset mutate(share_mill_inc = 100*n_millionincome/n_taxpayers,
share_mill_wealth = 100*n_millionfortune/n_taxpayers)
#head(dataset)
#tail(dataset)
- Ploting birth year versus share of taxpayers with million wealth.
%>%
dataset ggplot(aes(x=birthyr, y=share_mill_wealth)) +
geom_point() +
xlab("Birth year") +
ylab("Share of taxpayers with million wealth") +
theme_bw()
- It is obvious that the plot above is non-linear.
Example 4:
- In this example, we will scrape our SOK-1005 course time plan for this semester. Here is Sok-1005 time plan list for this semester.
#URL
<- "https://timeplan.uit.no/emne_timeplan.php?sem=23v&module%5B%5D=SOK-1005-1&week=1-20&View=list"
url <- read_html(url)
page
<- html_nodes(page, 'table') # one table per week
table <- html_table(table, fill=TRUE) # force them into a list
table
1]] table[[
# A tibble: 3 × 6
`Uke 3` `` `` `` `` ``
<chr> <chr> <chr> <chr> <chr> <chr>
1 Dato Tid Rom Emnekode Beskrivelse Lærer
2 Tirsdag17.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
3 Torsdag19.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
6]] table[[
# A tibble: 2 × 6
`Uke 8` `` `` `` `` ``
<chr> <chr> <chr> <chr> <chr> <chr>
1 Dato Tid Rom Emnekode Beskrivelse Lærer
2 Mandag20.02.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
#Stack the tables into a list
library(rlist)
<- list.stack(table, use.names = FALSE )
dframe head(dframe,5)
Uke 3 V1 V2 V3 V4 V5
1 Dato Tid Rom Emnekode Beskrivelse Lærer
2 Tirsdag17.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
3 Torsdag19.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
4 Dato Tid Rom Emnekode Beskrivelse Lærer
5 Tirsdag24.01.2023 14:15-16:00 FARM F1.101 SOK-1005-1 Forelesning D.G. Kidane
# define first row as variable name
colnames(dframe) <- dframe[1,]
dframe
Dato Tid Rom Emnekode Beskrivelse Lærer
1 Dato Tid Rom Emnekode Beskrivelse Lærer
2 Tirsdag17.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
3 Torsdag19.01.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
4 Dato Tid Rom Emnekode Beskrivelse Lærer
5 Tirsdag24.01.2023 14:15-16:00 FARM F1.101 SOK-1005-1 Forelesning D.G. Kidane
6 Dato Tid Rom Emnekode Beskrivelse Lærer
7 Mandag30.01.2023 10:15-12:00 HHT 02.217 SOK-1005-1 Seminar
8 Dato Tid Rom Emnekode Beskrivelse Lærer
9 Tirsdag07.02.2023 12:15-14:00 HHT 02.217 SOK-1005-1 Seminar
10 Dato Tid Rom Emnekode Beskrivelse Lærer
11 Tirsdag14.02.2023 14:15-16:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
12 Dato Tid Rom Emnekode Beskrivelse Lærer
13 Mandag20.02.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
14 Dato Tid Rom Emnekode Beskrivelse Lærer
15 Tirsdag28.02.2023 14:15-16:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
16 Dato Tid Rom Emnekode Beskrivelse Lærer
17 Tirsdag07.03.2023 14:15-16:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
18 Dato Tid Rom Emnekode Beskrivelse Lærer
19 Tirsdag14.03.2023 12:15-14:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
20 Dato Tid Rom Emnekode Beskrivelse Lærer
21 Tirsdag21.03.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
22 Dato Tid Rom Emnekode Beskrivelse Lærer
23 Fredag31.03.2023 14:15-16:00 HHT 01.202 SOK-1005-1 Forelesning D.G. Kidane
24 Dato Tid Rom Emnekode Beskrivelse Lærer
25 Torsdag13.04.2023 10:15-12:00 FARM F1.101 SOK-1005-1 Forelesning A. Jenssen
26 Dato Tid Rom Emnekode Beskrivelse Lærer
27 Torsdag20.04.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning A. Jenssen
28 Dato Tid Rom Emnekode Beskrivelse Lærer
29 Torsdag27.04.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning A. Jenssen
30 Dato Tid Rom Emnekode Beskrivelse Lærer
31 Torsdag04.05.2023 12:15-14:00 HHT 01.202 SOK-1005-1 Forelesning A. Jenssen
32 Dato Tid Rom Emnekode Beskrivelse Lærer
33 Torsdag11.05.2023 10:15-12:00 HHT 01.202 SOK-1005-1 Forelesning A. Jenssen
# remove the rows with Dato in it
<- dframe %>% filter(!Dato=="Dato")
dframe
str(dframe)
'data.frame': 17 obs. of 6 variables:
$ Dato : chr "Tirsdag17.01.2023" "Torsdag19.01.2023" "Tirsdag24.01.2023" "Mandag30.01.2023" ...
$ Tid : chr "10:15-12:00" "10:15-12:00" "14:15-16:00" "10:15-12:00" ...
$ Rom : chr "HHT 01.202" "HHT 01.202" "FARM F1.101" "HHT 02.217" ...
$ Emnekode : chr "SOK-1005-1" "SOK-1005-1" "SOK-1005-1" "SOK-1005-1" ...
$ Beskrivelse: chr "Forelesning" "Forelesning" "Forelesning" "Seminar" ...
$ Lærer : chr "D.G. Kidane" "D.G. Kidane" "D.G. Kidane" "" ...
# Separate the Dato into two columns:
<- dframe %>% separate(Dato,
dframe into = c("Dag", "Dato"),
sep = "(?<=[A-Za-z])(?=[0-9])") #sep, https://stackoverflow.com/questions/9756360/split-character-data-into-numbers-and-letters
# code into date format
$Dato <- as.Date(dframe$Dato, format="%d.%m.%Y")
dframe# generate a week variable
$Uke <- strftime(dframe$Dato, format = "%V") dframe
# select
<- dframe %>% select(Dag,Dato,Uke,Tid,Rom)
dframe dframe
Dag Dato Uke Tid Rom
1 Tirsdag 2023-01-17 03 10:15-12:00 HHT 01.202
2 Torsdag 2023-01-19 03 10:15-12:00 HHT 01.202
3 Tirsdag 2023-01-24 04 14:15-16:00 FARM F1.101
4 Mandag 2023-01-30 05 10:15-12:00 HHT 02.217
5 Tirsdag 2023-02-07 06 12:15-14:00 HHT 02.217
6 Tirsdag 2023-02-14 07 14:15-16:00 HHT 01.202
7 Mandag 2023-02-20 08 10:15-12:00 HHT 01.202
8 Tirsdag 2023-02-28 09 14:15-16:00 HHT 01.202
9 Tirsdag 2023-03-07 10 14:15-16:00 HHT 01.202
10 Tirsdag 2023-03-14 11 12:15-14:00 HHT 01.202
11 Tirsdag 2023-03-21 12 10:15-12:00 HHT 01.202
12 Fredag 2023-03-31 13 14:15-16:00 HHT 01.202
13 Torsdag 2023-04-13 15 10:15-12:00 FARM F1.101
14 Torsdag 2023-04-20 16 10:15-12:00 HHT 01.202
15 Torsdag 2023-04-27 17 10:15-12:00 HHT 01.202
16 Torsdag 2023-05-04 18 12:15-14:00 HHT 01.202
17 Torsdag 2023-05-11 19 10:15-12:00 HHT 01.202
str(dframe)
'data.frame': 17 obs. of 5 variables:
$ Dag : chr "Tirsdag" "Torsdag" "Tirsdag" "Mandag" ...
$ Dato: Date, format: "2023-01-17" "2023-01-19" ...
$ Uke : chr "03" "03" "04" "05" ...
$ Tid : chr "10:15-12:00" "10:15-12:00" "14:15-16:00" "10:15-12:00" ...
$ Rom : chr "HHT 01.202" "HHT 01.202" "FARM F1.101" "HHT 02.217" ...
Task
Build a procedure that can scrape “many” courses, e.g. the 3 courses you have this semester, from a list
Example 5
- We can also directly use R packages/libraries to scrape data into our R session. For instance, we can use the loadSymbols function from the quantmod package to scrape the stock prices of Exxon Mobile Corporation (XOM), traded at NYSE.
- The XOM stock prices can be obtained from The Yahoo! Fiance site.
# Import XOM data from yahoo finance
loadSymbols("XOM", src="yahoo" )
[1] "XOM"
<- XOM
xom_data head(xom_data)
XOM.Open XOM.High XOM.Low XOM.Close XOM.Volume XOM.Adjusted
2007-01-03 76.26 76.27 73.51 74.11 30510700 42.43665
2007-01-04 73.85 73.90 72.21 72.72 31046600 41.64070
2007-01-05 72.45 73.38 72.22 73.24 24671500 41.93845
2007-01-08 73.88 74.25 71.85 72.65 25981900 41.60062
2007-01-09 71.85 72.84 71.44 72.09 27039900 41.27994
2007-01-10 71.81 73.11 70.64 70.99 30572000 40.65007
class(xom_data) # this is xts or zoo
[1] "xts" "zoo"
# convert it to a data frame
<- as.data.frame(xom_data) #data frame
xom.df class(xom.df)
[1] "data.frame"
head(xom.df)
XOM.Open XOM.High XOM.Low XOM.Close XOM.Volume XOM.Adjusted
2007-01-03 76.26 76.27 73.51 74.11 30510700 42.43665
2007-01-04 73.85 73.90 72.21 72.72 31046600 41.64070
2007-01-05 72.45 73.38 72.22 73.24 24671500 41.93845
2007-01-08 73.88 74.25 71.85 72.65 25981900 41.60062
2007-01-09 71.85 72.84 71.44 72.09 27039900 41.27994
2007-01-10 71.81 73.11 70.64 70.99 30572000 40.65007
Data Wrangling
- The above data is daily data. We can do some data wrangling to convert the daily data into monthly data. Let us consider only the Xom.Adjusted price from January 4th of 2010 to December 30th of 2022.
# calculate the monthly arithmetic average of xom price data.
# Also calculate the weighted average of xom, using volume traded as a weight.
<- xom.df %>%
xom mutate(Date = ymd(rownames(xom.df))) %>% # create a date variable from the row names
filter(Date >="2010-01-04",Date<"2023-01-01") %>% # filter data only between the specified dates
mutate(year = year(Date), #create the variable called year from the date variable
month = month(Date), #create the variable month from the date variable
day = day(Date)) %>% #create the variables day from the date variable
group_by(year,month) %>%
# calculate arithmetic average of xom.adjusted
summarise(xom_monthly_art= mean(XOM.Adjusted),
#calculate weighted average using volume as a weight.
xom_monthly_wt=sum(XOM.Adjusted*XOM.Volume)/sum(XOM.Volume)) %>%
mutate(date = make_date(year,month,1)) %>% as_tibble() %>% # make the date variable, setting all day's as first day
select(date, xom_monthly_wt)
#output
head(xom,5)
# A tibble: 5 × 2
date xom_monthly_wt
<date> <dbl>
1 2010-01-01 41.2
2 2010-02-01 39.9
3 2010-03-01 40.7
4 2010-04-01 41.8
5 2010-05-01 38.6
# Plot
%>% ggplot(aes(x = date, y = xom_monthly_wt))+geom_point() xom
Not all website owners think it’s okay for us to scrap their websites. Hence, they may make their websites hard to scrape.Furthermore,there might be also ethical concerns to scrape a website. However, when someone posts data on a web page, they have made the data public and cannot decide how to read the data. What may be illegal is to disclose the data commercially.
Example 6
Here read Web Scraping with R studio and rvest
Task:
Use code to download from FRED the daily Brent Crude Oil Price from January 4th 2010 to December 30, 2022.
Perform some data wrangling and convert the daily price to month price.
Plot the monthly price against date.
API’s
An application programming interface (API) is a computing interface which in our case defines the interaction between our R session and a server with data. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used. Each API has its specified structure, but an API can automate your data collection. The basic steps are as follows:
Use the “httr” and “jsonlite” packages. Make a “GET” request to the API to pull raw data into your environment. “Parse” that data from its raw form through JavaScript Object Notification (JSON) into a usable format. If needed, loop through each “page” and retrieve the full data set. This RStudio webinar is on the basics, and some of the pitfalls of calling web API’s within R.
As an example, we will connect to: https://reqres.in a real API where we can test and learn the procedure. This is the R code for this example.
Reading data from Statistics Norway/SSB (using API, JSON or static web-scraping) Some data providers, like Statistics Norway (SSB) offer different options for data downloads. The data used in this example is found here.
We will download tables, 95274 (by counties)
We could either scrape it as a static web-page (csv link), scrape it as a JSON object (json link) or use an API with SSB’s R package PxWebApiData. Links with examples from R are found here. We will use the rjstat package to read JSON data.
The data is in long format. The names of the four columns in english should read: “region”, “date”, “variable”, “value”.
Challenges
We are faced with a couple of challenges. If we try to read the static web-page (csv) note that SSB uses “,” as a decimal point instead of “.” while this is not an issue using the API or JSON format. SSB uses the norwegian letter “æ, ø, å” in some labels.
The date format is a bit awkward, it uses a four digit year, then M for “month” and the month number with two digits. August 2020 would then be “2020M08”. It is formatted as a “character”. We need to change the date variable into a “date format”.
We will try different options to recode the Norwegian labels into English. The recode function in dplyr and in the car package, and ifelse statements.
When joining the two dataframes we have the data in long format. This is the R code for this example.