Web scraping with rvest and RSelenium
How to use rvest and RSelenium ot get list of billionaires
The idea is to show an example of use of both rvest
and RSelenium
in R to to some webscraping.
For rvest
, you only need to install the library. For RSelenium, you will to install the library, and will need to get a standalone version of firefox, using or the java standalone, or the docker standalone, see the selenium basic vignette. The third solution is to use the rsDriver
function, but it gave me a lot of problem lately.
Difference between rvest
and RSelenium
rvest
help you to read an html page and extract elements from it. The inetraction you can have with the webpage include interacting with forms and handle sessions with passwords.
It is incredibly efficient, and super handy, as almost all packages developped by Haddley Wickham.
So what is the point of RSelenium
? You cannot not interact with java with rvest
. RSelenium
allows you to have a webbrowser that you can control from you code. So you can do everything you do in a webbrowser. It is intended to test website for loading and problems.
I usually use only rvest
, and use RSelnium
when I need to interact with java. The website forbes with its billionaires list will give us an example: https://www.forbes.com/real-time-billionaires
Get the present list of billionaires
Let’s try to get the list of billionaires.
library(data.table)
library(rvest)
## Le chargement a nécessité le package : xml2
library(RSelenium)
## Warning: le package 'RSelenium' a été compilé avec la version R 3.6.3
Using the element inscpector, we see that the table is a dynamic table, and that the data of interest are in table rows <tr>
of class base ng-scope
. We can so read the page, and try to get the nodes corresponding to these rows:
url <- "https://www.forbes.com/real-time-billionaires"
page <- read_html(url)
page %>%
html_nodes(xpath = "//tr[@class='base ng-scope']")
## {xml_nodeset (0)}
Here it is not working. Let’s have a look at what the page contains:
page
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<div id="consent_blackbar"></div>\n\t<div id="teconsent">\n\t\t<s ...
The page is bloqued by the cookies consentment
Here, it will be easier with RSelenium
, because we have to click buttons. I use the firefox standalone with docker. I first run (in my computer console) docker run -d -p 4445:4444 selenium/standalone-firefox
to run the standalone firefox in port 4445.
remDr <- remoteDriver(port = 4445L)
remDr$open()
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
##
## $browserName
## [1] "firefox"
##
## $browserVersion
## [1] "84.0.1"
##
## $`moz:accessibilityChecks`
## [1] FALSE
##
## $`moz:buildID`
## [1] "20201221152838"
##
## $`moz:geckodriverVersion`
## [1] "0.28.0"
##
## $`moz:headless`
## [1] FALSE
##
## $`moz:processID`
## [1] 776
##
## $`moz:profile`
## [1] "/tmp/rust_mozprofileNDHipn"
##
## $`moz:shutdownTimeout`
## [1] 60000
##
## $`moz:useNonSpecCompliantPointerOrigin`
## [1] FALSE
##
## $`moz:webdriverClick`
## [1] TRUE
##
## $pageLoadStrategy
## [1] "normal"
##
## $platformName
## [1] "linux"
##
## $platformVersion
## [1] "4.19.121-linuxkit"
##
## $rotatable
## [1] FALSE
##
## $setWindowRect
## [1] TRUE
##
## $strictFileInteractability
## [1] FALSE
##
## $timeouts
## $timeouts$implicit
## [1] 0
##
## $timeouts$pageLoad
## [1] 300000
##
## $timeouts$script
## [1] 30000
##
##
## $unhandledPromptBehavior
## [1] "dismiss and notify"
##
## $webdriver.remote.sessionid
## [1] "31281786-c20e-4dd3-a0fa-f0ba23a86cfb"
##
## $id
## [1] "31281786-c20e-4dd3-a0fa-f0ba23a86cfb"
url <- "https://www.forbes.com/real-time-billionaires"
remDr$navigate(url)
You can get a screenshot of what the webrowser see, if you don’t have an actual webbrowser on your computer (you can see the webbrowser if you use the java standalone or rsDriver, not with docker):
remDr$screenshot(display = T)
I then want to detect the button, which has a class trustarc-agree-btn
. I use a lot xpath, because you can use functions like starts-with
or contains
, which help you get easily the parts you want. See this great tutorial on W3schools, and a list of functions in this page from mozilla.
webElems <- remDr$findElements(using = "xpath", "//button[starts-with(@class, 'trustarc')]")
We can check if we did get the proper button by checking the text of the element:
unlist(lapply(webElems, function(x) {x$getElementText()}))
## [1] "Accept All" "Choose Cookies"
We found the two button, and we want to click the first one:
webElems[[1]]$clickElement()
Sys.sleep(10) # wait for page loading
Now, the best way to get the data is to extract the html from the browser, and scrap it with rvest. Let’s get the table from the browser:
plouf <- remDr$findElements(using = "css", value = ".fbs-table")
table <- read_html(plouf[[1]]$getElementAttribute("outerHTML")[[1]]) # get html
And use rvest
to extract the lines of the table:
table %>% html_nodes(xpath = "//tr[@class='base ng-scope']")
## {xml_nodeset (25)}
## [1] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [2] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [3] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [4] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [5] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [6] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [7] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [8] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [9] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [10] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [11] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [12] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [13] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [14] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [15] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [16] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [17] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [18] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [19] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [20] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## ...
We have here only the 25 first billionaires. As the table if a dynamic table, you need to scroll it down to get more results (try on your webbrowser). We can do that with our standalone firefox: I find the table, and scroll down once using the sendKeysToElement
function:
webElem <- remDr$findElement("css", ".scrolly-table")
webElem$sendKeysToElement(list(key = "end"))
plouf <- remDr$findElements(using = "css", value = ".fbs-table")
Sys.sleep(1)
table <- read_html(plouf[[1]]$getElementAttribute("outerHTML")[[1]]) # get html
table %>% html_nodes(xpath = "//tr[@class='base ng-scope']")
## {xml_nodeset (50)}
## [1] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [2] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [3] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [4] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [5] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [6] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [7] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [8] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [9] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [10] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [11] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [12] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [13] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [14] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [15] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [16] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [17] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [18] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [19] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [20] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## ...
I now have 75 lines, instead of 25 in the beginning. I get 50 more lines each time I scroll down. There are around 2100 billionaires, so we should be ok scrolling down 50 times:
webElem <- remDr$findElement("css", ".scrolly-table")
for(i in 1:50){
flog.info("scroll %d",i)
webElem$sendKeysToElement(list(key = "end"))
Sys.sleep(3)
}
I then get the whole html of the table, to get all the lines:
plouf <- remDr$findElements(using = "css", value = ".fbs-table")
table <- plouf[[1]]$getElementAttribute("outerHTML")[[1]] # get html
# get all lines with attributes
From these lines, I extract the different info:
data <- table %>% html_nodes(xpath = "//tr[@class='base ng-scope']")
name = data %>% html_nodes(xpath = "//td[@class='name']")%>% html_text()
rank = data %>% html_nodes(xpath = "//td[@class='rank']")%>% html_text()
money = data %>% html_nodes(xpath = "//td[@class='Net Worth']")%>% html_text()
age = data %>% html_nodes(xpath = "//td[@class='age']")%>% html_text()
source = data %>% html_nodes(xpath = "//td[@class='source']")%>% html_text()
country = data %>% html_nodes(xpath = "//td[@class='Country/Territory']")%>% html_text()
and write a table:
forbes2020 <- data.frame(name = name,
rank = rank,
money = money,
age = age,
source = source,
country = country)
head(forbes2020)
## name rank money age source country
## 1 Jeff Bezos 1 $176.8 B 57 Amazon United States
## 2 Bernard Arnault & family 2 $158.4 B 72 LVMH France
## 3 Elon Musk 3 $152.1 B 49 Tesla, SpaceX United States
## 4 Bill Gates 4 $126.9 B 65 Microsoft United States
## 5 Mark Zuckerberg 5 $101.9 B 36 Facebook United States
## 6 Warren Buffett 6 $97.1 B 90 Berkshire Hathaway United States
write.csv(forbes2020,file = "forbes2020.csv",row.names = F,na = "")
get forbes from 2010
To get old data, you can use the wonderful web archive: https://web.archive.org. And you will see that in 2010, there was absolutely no java, so it will be much easier. The data is in an html table, and rvest
has a function to directly read and convert html tables:
url <- "https://web.archive.org/web/20110104234404/http://www.forbes.com/lists/2010/10/billionaires-2010_The-Worlds-Billionaires_Rank.html"
page <- read_html( url ) #read
tempdf <- page%>%
html_nodes(xpath = "//table") %>% # get table
html_table(header = T) %>%
.[[1]]
tempdf
## Rank Name Citizenship Age Net Worth ($bil)
## 1 1 Carlos Slim Helu & family Mexico 70 53.5
## 2 2 William Gates III United States 54 53.0
## 3 3 Warren Buffett United States 79 47.0
## 4 4 Mukesh Ambani India 52 29.0
## 5 5 Lakshmi Mittal India 59 28.7
## 6 6 Lawrence Ellison United States 65 28.0
## 7 7 Bernard Arnault France 61 27.5
## 8 8 Eike Batista Brazil 53 27.0
## 9 9 Amancio Ortega Spain 74 25.0
## 10 10 Karl Albrecht Germany 90 23.5
## 11 11 Ingvar Kamprad & family Sweden 83 23.0
## 12 12 Christy Walton & family United States 55 22.5
## 13 13 Stefan Persson Sweden 62 22.4
## 14 14 Li Ka-shing Hong Kong 81 21.0
## 15 15 Jim Walton United States 62 20.7
## 16 16 Alice Walton United States 60 20.6
## 17 17 Liliane Bettencourt France 87 20.0
## 18 18 S. Robson Walton United States 66 19.8
## 19 19 Prince Alwaleed Bin Talal Alsaud Saudi Arabia 55 19.4
## 20 20 David Thomson & family Canada 52 19.0
## 21 21 Michael Otto & family Germany 66 18.7
## 22 22 Lee Shau Kee Hong Kong 82 18.5
## 23 23 Michael Bloomberg United States 68 18.0
## 24 24 Sergey Brin United States 36 17.5
## 25 24 Charles Koch United States 74 17.5
## Residence
## 1 Mexico
## 2 United States
## 3 United States
## 4 India
## 5 United Kingdom
## 6 United States
## 7 France
## 8 Brazil
## 9 Spain
## 10 Germany
## 11 Switzerland
## 12 United States
## 13 Sweden
## 14 Hong Kong
## 15 United States
## 16 United States
## 17 France
## 18 United States
## 19 Saudi Arabia
## 20 Canada
## 21 Germany
## 22 Hong Kong
## 23 United States
## 24 United States
## 25 United States
Here we get the 25 first billionaires. The pages 2 to 38 can be accessed directly from the url: page 2 is https://web.archive.org/web/20110104234404/http://www.forbes.com/lists/2010/10/billionaires-2010_The-Worlds-Billionaires_Rank_2.html. So, we can construct a list of url, extract the table in each, and bind all tables together:
url_base <- "https://web.archive.org/web/20110104234404/http://www.forbes.com/lists/2010/10/billionaires-2010_The-Worlds-Billionaires_Rank"
url_num <- c("",paste0("_",2:38))
# list of all billionaires tables
urls <- paste0(url_base,url_num,".html")
table_list <- lapply(urls,function(url){
page <- read_html( url ) #read
tempdf <- page%>%
html_nodes(xpath = "//table") %>% # get table
html_table(header = T) %>%
.[[1]]
Sys.sleep(10) # don't overuse web.archive
tempdf
})
You can then bind them
forbes2010 <- do.call(rbind,table_list)
# save
write.csv(forbes2010,file = "forbes2010.csv",row.names = F,na = "")
To compare the two lists, you need to do a bit of cleaning and identify the different billionaires. My two complete lists are there:
I will do an other post for the cohort data management. C’est tout pour cette fois.