Read html tables with cells spanning on multiple rows

A post to detail a function I created to answer a question on stackoverflow. My proposition overcomes a limitation of the read_html function of rvest package. The idea behind is a bit tricky, and the solution uses rvest itself. I detail it here, because it is a good example of rvest use.

The problem

rvest is an incredible package, but the read_html function cannot read properly tables with cells spanning on multiple rows. There are plenty in wikipedia, and the example was this one:

Here is what the read_html function gets:

library(rvest)
## Le chargement a nécessité le package : xml2
library(flextable)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
  read_html() %>%
  html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
  html_table(fill=TRUE) %>%  # fill=FALSE yields the same results
  .[[1]]
flextable(parks) %>% theme_box()

Division

Team

MLB Affiliation

City

Stadium

Capacity

East

Arizona League Angels

Los Angeles Angels

Tempe

Tempe Diablo Stadium

9,785

East

Arizona League Athletics Gold

Oakland Athletics

Mesa

Fitch Park

10,000

East

Arizona League Athletics Gold

Oakland Athletics

Mesa

Fitch Park

Arizona League Athletics Green

East

Arizona League Cubs 1

Oakland Athletics

Chicago Cubs

Sloan Park

15,000

East

Arizona League Cubs 1

Oakland Athletics

Chicago Cubs

Oakland Athletics

Arizona League Cubs 2

East

Arizona League Diamondbacks

Arizona Diamondbacks

Scottsdale

Salt River Fields at Talking Stick

11,000

East

Arizona League Giants Black

Arizona Diamondbacks

San Francisco Giants

Scottsdale Stadium

12,000

East

Arizona League Giants Black

Arizona Diamondbacks

San Francisco Giants

Arizona Diamondbacks

Arizona League Giants Orange

Central

Arizona League Brewers Gold

Milwaukee Brewers

Phoenix

American Family Fields of Phoenix

8,000

Central

Arizona League Dodgers Lasorda

Los Angeles Dodgers

Phoenix

Camelback Ranch

12,000

Central

Arizona League Indians Blue

Cleveland Indians

Goodyear

Goodyear Ballpark

10,000

Central

Arizona League Padres 2

San Diego Padres

Peoria

Peoria Sports Complex

12,882

Central

Arizona League Reds

Cincinnati Reds

Goodyear

Goodyear Ballpark

10,000

Central

Arizona League White Sox

Chicago White Sox

Phoenix

Camelback Ranch

12,000

West

Arizona League Brewers Blue

Milwaukee Brewers

Phoenix

American Family Fields of Phoenix

8,000

West

Arizona League Dodgers Mota

Los Angeles Dodgers

Phoenix

Camelback Ranch

12,000

West

Arizona League Indians Red

Cleveland Indians

Goodyear

Goodyear Ballpark

10,000

West

Arizona League Mariners

Seattle Mariners

Peoria

Peoria Sports Complex

12,882

West

Arizona League Padres 1

Seattle Mariners

Peoria

Peoria Sports Complex

San Diego Padres

West

Arizona League Rangers

Texas Rangers

Surprise

Surprise Stadium

10,500

West

Arizona League Royals

Texas Rangers

Surprise

Surprise Stadium

Kansas City Royals

You can see several errors: line 4 and 5, the coty should be Mesa, line 5, the stadium should be Sloan Park, last column as info of other columns and so on.

The solution I propose uses the package rvest itself. Let’s inspect the table by extracting the lines attributes of the table:

library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"

# get the lines of the table
lines <- url %>%
  read_html() %>%
  html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
  html_nodes(xpath = 'tbody/tr')
lines
## {xml_nodeset (22)}
##  [1] <tr>\n<th>Division\n</th>\n<th>Team\n</th>\n<th>MLB Affiliation\n</ ...
##  [2] <tr>\n<th rowspan="8">East\n</th>\n<td>\n<b><a href="/wiki/Arizona_ ...
##  [3] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Athletics" title="Ariz ...
##  [4] <tr><td>\n<b><a href="/wiki/Arizona_League_Athletics" title="Arizon ...
##  [5] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Cubs" title="Arizona L ...
##  [6] <tr><td>\n<b><a href="/wiki/Arizona_League_Cubs" title="Arizona Lea ...
##  [7] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Diamondbacks" title="A ...
##  [8] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Giants" title="Arizona ...
##  [9] <tr><td>\n<b><a href="/wiki/Arizona_League_Giants" title="Arizona L ...
## [10] <tr>\n<th rowspan="6">Central\n</th>\n<td>\n<b><a href="/wiki/Arizo ...
## [11] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Dodgers" title="Arizon ...
## [12] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Indians" title="Arizon ...
## [13] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Padres" title="Arizona ...
## [14] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Reds" title="Arizona L ...
## [15] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_White_Sox" title="Ariz ...
## [16] <tr>\n<th rowspan="7">West\n</th>\n<td>\n<b><a href="/wiki/Arizona_ ...
## [17] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Dodgers" title="Arizon ...
## [18] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Indians" title="Arizon ...
## [19] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Mariners" title="Arizo ...
## [20] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Padres" title="Arizona ...
## ...

The table has 20 lines, and the first line contains the column names. Its length gives the html table’s number of columns

#define the empty table
ncol <-  lines %>%
  .[[1]] %>%
  html_children()%>%
  length()

The number of rows is given by the length of the tr xml nodes minus 1, because the first line are the titles

nrow <- length(lines)-1

Let’s create our output data.frame which will contain our scrapped table:

table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))

To get the content of a line, we get each line children’s text and remove the \n line returns. Here for the column names:

lines[[1]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
## [1] "Division"        "Team"            "MLB Affiliation" "City"           
## [5] "Stadium"         "Capacity"

I can thus set the names of my scrapped table:

names(table) <- lines[[1]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)

Let us have a look at the first and second line of the html table to understand how the cells spanning multiple lines are coded

lines[[2]]%>%
    html_children()
## {xml_nodeset (6)}
## [1] <th rowspan="8">East\n</th>
## [2] <td>\n<b><a href="/wiki/Arizona_League_Angels" title="Arizona League ...
## [3] <td>\n<a href="/wiki/Los_Angeles_Angels" title="Los Angeles Angels"> ...
## [4] <td>\n<a href="/wiki/Tempe,_Arizona" title="Tempe, Arizona">Tempe</a ...
## [5] <td>\n<a href="/wiki/Tempe_Diablo_Stadium" title="Tempe Diablo Stadi ...
## [6] <td align="right">9,785\n</td>

Here we have an attribute rowspan telling us that the first column (Division) spans on height rows. Let us have a look at the second line:

lines[[3]]%>%
    html_children()
## {xml_nodeset (5)}
## [1] <td>\n<b><a href="/wiki/Arizona_League_Athletics" title="Arizona Lea ...
## [2] <td rowspan="2">\n<a href="/wiki/Oakland_Athletics" title="Oakland A ...
## [3] <td rowspan="4">\n<a href="/wiki/Mesa,_Arizona" title="Mesa, Arizona ...
## [4] <td rowspan="2">\n<a href="/wiki/Fitch_Park" title="Fitch Park">Fitc ...
## [5] <td rowspan="2" align="right">10,000\n</td>

The second line of the table only have 5 columns, because the first columns is defined on the previous line and spanning on multiple rows. So the second line of the first table must be filled when reading the first line. The idea is thus to fill all the rows a column spanns on when we have the info.

Lets have an example for the first two lines:

The first line

The content of the first line is:

linecontent <- lines[[2]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
linecontent
## [1] "East"                  "Arizona League Angels" "Los Angeles Angels"   
## [4] "Tempe"                 "Tempe Diablo Stadium"  "9,785"

and the number of line each column spans on is:

repetition <- lines[[2]]%>%
  html_children()%>%
  html_attr("rowspan")%>%
  ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
  as.numeric
repetition
## [1] 8 1 1 1 1 1

We see that the first column spans on 8 rows. All the idea is to fill already all the lines this column will spann on in our output table:

for(j in 1:length(repetition)){
  span <- repetition[j]
  table[(1):(span),j] <- rep(linecontent[j],span)
}
table %>% flextable()%>% theme_box()

Division

Team

MLB Affiliation

City

Stadium

Capacity

East

Arizona League Angels

Los Angeles Angels

Tempe

Tempe Diablo Stadium

9,785

East

East

East

East

East

East

East

The second line

Now let us have a look at the second line

linecontent <- lines[[3]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
linecontent
## [1] "Arizona League Athletics Gold" "Oakland Athletics"            
## [3] "Mesa"                          "Fitch Park"                   
## [5] "10,000"

I have only 5 values, because the first column spans on several rows. I can proceed exactly as for the first row, but I must select the column with missing values in my output table, in order to have a vector of the same length as the info given by the html object:

colselect <- is.na(table[2,])
colselect
##   Division Team MLB Affiliation City Stadium Capacity
## 2    FALSE TRUE            TRUE TRUE    TRUE     TRUE

We get the number of line repetition for the concerned columnc

repetition <- lines[[3]]%>%
    html_children()%>%
    html_attr("rowspan")%>%
    ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
    as.numeric
repetition
## [1] 1 2 4 2 2

and fill the empty columns, without forgetting to repeat them on several line if needed:

for(j in 1:length(repetition)){
  span <- repetition[j]
  table[(2):(1+span),colselect][,j] <- rep(linecontent[j],span)
}
table %>% flextable() %>% theme_box()

Division

Team

MLB Affiliation

City

Stadium

Capacity

East

Arizona League Angels

Los Angeles Angels

Tempe

Tempe Diablo Stadium

9,785

East

Arizona League Athletics Gold

Oakland Athletics

Mesa

Fitch Park

10,000

East

Oakland Athletics

Mesa

Fitch Park

10,000

East

Mesa

East

Mesa

East

East

East

Line three of the html table gives only one value

lines[[4]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
## [1] "Arizona League Athletics Green"

and it is perfect, because I only have one missing value in my output table.

The solution

We can repeat the process on each lines:

table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
names(table) <- lines[[1]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
# fill the table
for(i in 1:nrow){
  # get content of the line
  linecontent <- lines[[i+1]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)

# get the line repetition of each columns
  repetition <- lines[[i+1]]%>%
    html_children()%>%
    html_attr("rowspan")%>%
    ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
    as.numeric

  #select only free columns
  colselect <- is.na(table[i,])

  
 # repeat the cells of the multiple rows down
  for(j in 1:length(repetition)){
    span <- repetition[j]
    if(sum(colselect)>1){
    table[(i):(i+span-1),colselect][,j] <- rep(linecontent[j],span)  
    }else{table[(i):(i+span-1),colselect] <- rep(linecontent[j],span)  }
    
  }
}
table %>%
  flextable() %>% theme_box()

Division

Team

MLB Affiliation

City

Stadium

Capacity

East

Arizona League Angels

Los Angeles Angels

Tempe

Tempe Diablo Stadium

9,785

East

Arizona League Athletics Gold

Oakland Athletics

Mesa

Fitch Park

10,000

East

Arizona League Athletics Green

Oakland Athletics

Mesa

Fitch Park

10,000

East

Arizona League Cubs 1

Chicago Cubs

Mesa

Sloan Park

15,000

East

Arizona League Cubs 2

Chicago Cubs

Mesa

Sloan Park

15,000

East

Arizona League Diamondbacks

Arizona Diamondbacks

Scottsdale

Salt River Fields at Talking Stick

11,000

East

Arizona League Giants Black

San Francisco Giants

Scottsdale

Scottsdale Stadium

12,000

East

Arizona League Giants Orange

San Francisco Giants

Scottsdale

Scottsdale Stadium

12,000

Central

Arizona League Brewers Gold

Milwaukee Brewers

Phoenix

American Family Fields of Phoenix

8,000

Central

Arizona League Dodgers Lasorda

Los Angeles Dodgers

Phoenix

Camelback Ranch

12,000

Central

Arizona League Indians Blue

Cleveland Indians

Goodyear

Goodyear Ballpark

10,000

Central

Arizona League Padres 2

San Diego Padres

Peoria

Peoria Sports Complex

12,882

Central

Arizona League Reds

Cincinnati Reds

Goodyear

Goodyear Ballpark

10,000

Central

Arizona League White Sox

Chicago White Sox

Phoenix

Camelback Ranch

12,000

West

Arizona League Brewers Blue

Milwaukee Brewers

Phoenix

American Family Fields of Phoenix

8,000

West

Arizona League Dodgers Mota

Los Angeles Dodgers

Phoenix

Camelback Ranch

12,000

West

Arizona League Indians Red

Cleveland Indians

Goodyear

Goodyear Ballpark

10,000

West

Arizona League Mariners

Seattle Mariners

Peoria

Peoria Sports Complex

12,882

West

Arizona League Padres 1

San Diego Padres

Peoria

Peoria Sports Complex

12,882

West

Arizona League Rangers

Texas Rangers

Surprise

Surprise Stadium

10,500

West

Arizona League Royals

Kansas City Royals

Surprise

Surprise Stadium

10,500

Et voilà !

Avatar
Denis Mongin
Physicist, Data scientist

Related