Web scraping with rvest and RSelenium

How to use rvest and RSelenium ot get list of billionaires

The idea is to show an example of use of both rvest and RSelenium in R to to some webscraping. For rvest, you only need to install the library. For RSelenium, you will to install the library, and will need to get a standalone version of firefox, using or the java standalone, or the docker standalone, see the selenium basic vignette. The third solution is to use the rsDriver function, but it gave me a lot of problem lately.

Difference between rvest and RSelenium

rvest help you to read an html page and extract elements from it. The inetraction you can have with the webpage include interacting with forms and handle sessions with passwords. It is incredibly efficient, and super handy, as almost all packages developped by Haddley Wickham.

So what is the point of RSelenium ? You cannot not interact with java with rvest. RSelenium allows you to have a webbrowser that you can control from you code. So you can do everything you do in a webbrowser. It is intended to test website for loading and problems.

I usually use only rvest, and use RSelnium when I need to interact with java. The website forbes with its billionaires list will give us an example: https://www.forbes.com/real-time-billionaires

Get the present list of billionaires

Let’s try to get the list of billionaires.

library(data.table)
library(rvest)
## Le chargement a nécessité le package : xml2
library(RSelenium)
## Warning: le package 'RSelenium' a été compilé avec la version R 3.6.3

Using the element inscpector, we see that the table is a dynamic table, and that the data of interest are in table rows <tr> of class base ng-scope. We can so read the page, and try to get the nodes corresponding to these rows:

url <- "https://www.forbes.com/real-time-billionaires"
page <- read_html(url)
page %>%
  html_nodes(xpath = "//tr[@class='base ng-scope']")
## {xml_nodeset (0)}

Here it is not working. Let’s have a look at what the page contains:

page
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<div id="consent_blackbar"></div>\n\t<div id="teconsent">\n\t\t<s ...

The page is bloqued by the cookies consentment

Here, it will be easier with RSelenium, because we have to click buttons. I use the firefox standalone with docker. I first run (in my computer console) docker run -d -p 4445:4444 selenium/standalone-firefox to run the standalone firefox in port 4445.

remDr <- remoteDriver(port = 4445L)
remDr$open()
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "firefox"
## 
## $browserVersion
## [1] "84.0.1"
## 
## $`moz:accessibilityChecks`
## [1] FALSE
## 
## $`moz:buildID`
## [1] "20201221152838"
## 
## $`moz:geckodriverVersion`
## [1] "0.28.0"
## 
## $`moz:headless`
## [1] FALSE
## 
## $`moz:processID`
## [1] 776
## 
## $`moz:profile`
## [1] "/tmp/rust_mozprofileNDHipn"
## 
## $`moz:shutdownTimeout`
## [1] 60000
## 
## $`moz:useNonSpecCompliantPointerOrigin`
## [1] FALSE
## 
## $`moz:webdriverClick`
## [1] TRUE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "linux"
## 
## $platformVersion
## [1] "4.19.121-linuxkit"
## 
## $rotatable
## [1] FALSE
## 
## $setWindowRect
## [1] TRUE
## 
## $strictFileInteractability
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $unhandledPromptBehavior
## [1] "dismiss and notify"
## 
## $webdriver.remote.sessionid
## [1] "31281786-c20e-4dd3-a0fa-f0ba23a86cfb"
## 
## $id
## [1] "31281786-c20e-4dd3-a0fa-f0ba23a86cfb"
url <- "https://www.forbes.com/real-time-billionaires"
remDr$navigate(url)

You can get a screenshot of what the webrowser see, if you don’t have an actual webbrowser on your computer (you can see the webbrowser if you use the java standalone or rsDriver, not with docker):

remDr$screenshot(display = T)

I then want to detect the button, which has a class trustarc-agree-btn. I use a lot xpath, because you can use functions like starts-with or contains, which help you get easily the parts you want. See this great tutorial on W3schools, and a list of functions in this page from mozilla.

webElems <- remDr$findElements(using = "xpath", "//button[starts-with(@class, 'trustarc')]")

We can check if we did get the proper button by checking the text of the element:

unlist(lapply(webElems, function(x) {x$getElementText()}))
## [1] "Accept All"     "Choose Cookies"

We found the two button, and we want to click the first one:

webElems[[1]]$clickElement()
Sys.sleep(10) # wait for page loading

Now, the best way to get the data is to extract the html from the browser, and scrap it with rvest. Let’s get the table from the browser:

plouf <- remDr$findElements(using = "css", value = ".fbs-table")
table <- read_html(plouf[[1]]$getElementAttribute("outerHTML")[[1]]) # get html

And use rvest to extract the lines of the table:

table  %>% html_nodes(xpath = "//tr[@class='base ng-scope']")
## {xml_nodeset (25)}
##  [1] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [2] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [3] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [4] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [5] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [6] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [7] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [8] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [9] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [10] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [11] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [12] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [13] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [14] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [15] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [16] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [17] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [18] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [19] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [20] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## ...

We have here only the 25 first billionaires. As the table if a dynamic table, you need to scroll it down to get more results (try on your webbrowser). We can do that with our standalone firefox: I find the table, and scroll down once using the sendKeysToElement function:

webElem <- remDr$findElement("css", ".scrolly-table")
webElem$sendKeysToElement(list(key = "end"))
plouf <- remDr$findElements(using = "css", value = ".fbs-table")
Sys.sleep(1)
table <- read_html(plouf[[1]]$getElementAttribute("outerHTML")[[1]]) # get html
table  %>% html_nodes(xpath = "//tr[@class='base ng-scope']")
## {xml_nodeset (50)}
##  [1] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [2] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [3] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [4] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [5] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [6] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [7] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [8] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
##  [9] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [10] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [11] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [12] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [13] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [14] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [15] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [16] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [17] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [18] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [19] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## [20] <tr class="base ng-scope" ng-repeat-start="user in $data" ng-if="($index ...
## ...

I now have 75 lines, instead of 25 in the beginning. I get 50 more lines each time I scroll down. There are around 2100 billionaires, so we should be ok scrolling down 50 times:

webElem <- remDr$findElement("css", ".scrolly-table")
for(i in 1:50){
  flog.info("scroll %d",i)
  webElem$sendKeysToElement(list(key = "end"))
  Sys.sleep(3)
}

I then get the whole html of the table, to get all the lines:

plouf <- remDr$findElements(using = "css", value = ".fbs-table")
table <- plouf[[1]]$getElementAttribute("outerHTML")[[1]] # get html
# get all lines with attributes

From these lines, I extract the different info:

data <- table  %>% html_nodes(xpath = "//tr[@class='base ng-scope']")

name = data %>% html_nodes(xpath = "//td[@class='name']")%>% html_text()
rank = data %>% html_nodes(xpath = "//td[@class='rank']")%>% html_text()
money = data %>% html_nodes(xpath = "//td[@class='Net Worth']")%>% html_text()
age = data %>% html_nodes(xpath = "//td[@class='age']")%>% html_text()
source = data %>% html_nodes(xpath = "//td[@class='source']")%>% html_text()
country = data %>% html_nodes(xpath = "//td[@class='Country/Territory']")%>% html_text()

and write a table:

forbes2020 <- data.frame(name = name,
                   rank = rank,
                   money = money,
                   age = age,
                   source = source,
                   country = country)
head(forbes2020)
##                       name rank    money age             source       country
## 1               Jeff Bezos    1 $176.8 B  57             Amazon United States
## 2 Bernard Arnault & family    2 $158.4 B  72               LVMH        France
## 3                Elon Musk    3 $152.1 B  49      Tesla, SpaceX United States
## 4               Bill Gates    4 $126.9 B  65          Microsoft United States
## 5          Mark Zuckerberg    5 $101.9 B  36           Facebook United States
## 6           Warren Buffett    6  $97.1 B  90 Berkshire Hathaway United States
write.csv(forbes2020,file = "forbes2020.csv",row.names = F,na = "")

get forbes from 2010

To get old data, you can use the wonderful web archive: https://web.archive.org. And you will see that in 2010, there was absolutely no java, so it will be much easier. The data is in an html table, and rvest has a function to directly read and convert html tables:

url <- "https://web.archive.org/web/20110104234404/http://www.forbes.com/lists/2010/10/billionaires-2010_The-Worlds-Billionaires_Rank.html"
page <- read_html( url ) #read
tempdf <- page%>% 
  html_nodes(xpath = "//table") %>% # get table
  html_table(header = T) %>% 
  .[[1]]
tempdf
##    Rank                             Name   Citizenship Age Net Worth ($bil)
## 1     1        Carlos Slim Helu & family        Mexico  70             53.5
## 2     2                William Gates III United States  54             53.0
## 3     3                   Warren Buffett United States  79             47.0
## 4     4                    Mukesh Ambani         India  52             29.0
## 5     5                   Lakshmi Mittal         India  59             28.7
## 6     6                 Lawrence Ellison United States  65             28.0
## 7     7                  Bernard Arnault        France  61             27.5
## 8     8                     Eike Batista        Brazil  53             27.0
## 9     9                   Amancio Ortega         Spain  74             25.0
## 10   10                    Karl Albrecht       Germany  90             23.5
## 11   11          Ingvar Kamprad & family        Sweden  83             23.0
## 12   12          Christy Walton & family United States  55             22.5
## 13   13                   Stefan Persson        Sweden  62             22.4
## 14   14                      Li Ka-shing     Hong Kong  81             21.0
## 15   15                       Jim Walton United States  62             20.7
## 16   16                     Alice Walton United States  60             20.6
## 17   17              Liliane Bettencourt        France  87             20.0
## 18   18                 S. Robson Walton United States  66             19.8
## 19   19 Prince Alwaleed Bin Talal Alsaud  Saudi Arabia  55             19.4
## 20   20           David Thomson & family        Canada  52             19.0
## 21   21            Michael Otto & family       Germany  66             18.7
## 22   22                     Lee Shau Kee     Hong Kong  82             18.5
## 23   23                Michael Bloomberg United States  68             18.0
## 24   24                      Sergey Brin United States  36             17.5
## 25   24                     Charles Koch United States  74             17.5
##         Residence
## 1          Mexico
## 2   United States
## 3   United States
## 4           India
## 5  United Kingdom
## 6   United States
## 7          France
## 8          Brazil
## 9           Spain
## 10        Germany
## 11    Switzerland
## 12  United States
## 13         Sweden
## 14      Hong Kong
## 15  United States
## 16  United States
## 17         France
## 18  United States
## 19   Saudi Arabia
## 20         Canada
## 21        Germany
## 22      Hong Kong
## 23  United States
## 24  United States
## 25  United States

Here we get the 25 first billionaires. The pages 2 to 38 can be accessed directly from the url: page 2 is https://web.archive.org/web/20110104234404/http://www.forbes.com/lists/2010/10/billionaires-2010_The-Worlds-Billionaires_Rank_2.html. So, we can construct a list of url, extract the table in each, and bind all tables together:

url_base <- "https://web.archive.org/web/20110104234404/http://www.forbes.com/lists/2010/10/billionaires-2010_The-Worlds-Billionaires_Rank"
url_num <- c("",paste0("_",2:38))
# list of all billionaires tables
urls <-  paste0(url_base,url_num,".html")
table_list <- lapply(urls,function(url){
  page <- read_html( url ) #read
  tempdf <- page%>% 
    html_nodes(xpath = "//table") %>% # get table
    html_table(header = T) %>% 
    .[[1]]
  Sys.sleep(10) # don't overuse web.archive
  tempdf
})

You can then bind them

forbes2010 <- do.call(rbind,table_list)
# save
write.csv(forbes2010,file = "forbes2010.csv",row.names = F,na = "")

To compare the two lists, you need to do a bit of cleaning and identify the different billionaires. My two complete lists are there:

I will do an other post for the cohort data management. C’est tout pour cette fois.

Avatar
Denis Mongin
Physicist, Data scientist

Related