Read html tables with cells spanning on multiple rows
A post to detail a function I created to answer a question on stackoverflow. My proposition overcomes a limitation of the read_html
function of rvest
package. The idea behind is a bit tricky, and the solution uses rvest
itself. I detail it here, because it is a good example of rvest
use.
The problem
rvest
is an incredible package, but the read_html
function cannot read properly tables with cells spanning on multiple rows.
There are plenty in wikipedia, and the example was this one:
Here is what the read_html
function gets:
library(rvest)
## Le chargement a nécessité le package : xml2
library(flextable)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
read_html() %>%
html_nodes(xpath="//table[starts-with(@class, 'wikitable')]") %>%
html_table(fill=TRUE) %>% # fill=FALSE yields the same results
.[[1]]
flextable(parks) %>% theme_box()
Division | Team | MLB Affiliation | City | Stadium | Capacity |
East | Arizona League Angels | Los Angeles Angels | Tempe | Tempe Diablo Stadium | 9,785 |
East | Arizona League Athletics Gold | Oakland Athletics | Mesa | Fitch Park | 10,000 |
East | Arizona League Athletics Gold | Oakland Athletics | Mesa | Fitch Park | Arizona League Athletics Green |
East | Arizona League Cubs 1 | Oakland Athletics | Chicago Cubs | Sloan Park | 15,000 |
East | Arizona League Cubs 1 | Oakland Athletics | Chicago Cubs | Oakland Athletics | Arizona League Cubs 2 |
East | Arizona League Diamondbacks | Arizona Diamondbacks | Scottsdale | Salt River Fields at Talking Stick | 11,000 |
East | Arizona League Giants Black | Arizona Diamondbacks | San Francisco Giants | Scottsdale Stadium | 12,000 |
East | Arizona League Giants Black | Arizona Diamondbacks | San Francisco Giants | Arizona Diamondbacks | Arizona League Giants Orange |
Central | Arizona League Brewers Gold | Milwaukee Brewers | Phoenix | American Family Fields of Phoenix | 8,000 |
Central | Arizona League Dodgers Lasorda | Los Angeles Dodgers | Phoenix | Camelback Ranch | 12,000 |
Central | Arizona League Indians Blue | Cleveland Indians | Goodyear | Goodyear Ballpark | 10,000 |
Central | Arizona League Padres 2 | San Diego Padres | Peoria | Peoria Sports Complex | 12,882 |
Central | Arizona League Reds | Cincinnati Reds | Goodyear | Goodyear Ballpark | 10,000 |
Central | Arizona League White Sox | Chicago White Sox | Phoenix | Camelback Ranch | 12,000 |
West | Arizona League Brewers Blue | Milwaukee Brewers | Phoenix | American Family Fields of Phoenix | 8,000 |
West | Arizona League Dodgers Mota | Los Angeles Dodgers | Phoenix | Camelback Ranch | 12,000 |
West | Arizona League Indians Red | Cleveland Indians | Goodyear | Goodyear Ballpark | 10,000 |
West | Arizona League Mariners | Seattle Mariners | Peoria | Peoria Sports Complex | 12,882 |
West | Arizona League Padres 1 | Seattle Mariners | Peoria | Peoria Sports Complex | San Diego Padres |
West | Arizona League Rangers | Texas Rangers | Surprise | Surprise Stadium | 10,500 |
West | Arizona League Royals | Texas Rangers | Surprise | Surprise Stadium | Kansas City Royals |
You can see several errors: line 4 and 5, the coty should be Mesa, line 5, the stadium should be Sloan Park, last column as info of other columns and so on.
The solution I propose uses the package rvest
itself.
Let’s inspect the table by extracting the lines attributes of the table:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
# get the lines of the table
lines <- url %>%
read_html() %>%
html_nodes(xpath="//table[starts-with(@class, 'wikitable')]") %>%
html_nodes(xpath = 'tbody/tr')
lines
## {xml_nodeset (22)}
## [1] <tr>\n<th>Division\n</th>\n<th>Team\n</th>\n<th>MLB Affiliation\n</th>\n ...
## [2] <tr>\n<th rowspan="8">East\n</th>\n<td>\n<b><a href="/wiki/Arizona_Leagu ...
## [3] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Athletics" title="Arizona L ...
## [4] <tr><td>\n<b><a href="/wiki/Arizona_League_Athletics" title="Arizona Lea ...
## [5] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Cubs" title="Arizona League ...
## [6] <tr><td>\n<b><a href="/wiki/Arizona_League_Cubs" title="Arizona League C ...
## [7] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Diamondbacks" title="Arizon ...
## [8] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Giants" title="Arizona Leag ...
## [9] <tr><td>\n<b><a href="/wiki/Arizona_League_Giants" title="Arizona League ...
## [10] <tr>\n<th rowspan="6">Central\n</th>\n<td>\n<b><a href="/wiki/Arizona_Le ...
## [11] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Dodgers" title="Arizona Lea ...
## [12] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Indians" title="Arizona Lea ...
## [13] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Padres" title="Arizona Leag ...
## [14] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Reds" title="Arizona League ...
## [15] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_White_Sox" title="Arizona L ...
## [16] <tr>\n<th rowspan="7">West\n</th>\n<td>\n<b><a href="/wiki/Arizona_Leagu ...
## [17] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Dodgers" title="Arizona Lea ...
## [18] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Indians" title="Arizona Lea ...
## [19] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Mariners" title="Arizona Le ...
## [20] <tr>\n<td>\n<b><a href="/wiki/Arizona_League_Padres" title="Arizona Leag ...
## ...
The table has 20 lines, and the first line contains the column names. Its length gives the html table’s number of columns
#define the empty table
ncol <- lines %>%
.[[1]] %>%
html_children()%>%
length()
The number of rows is given by the length of the tr
xml nodes minus 1, because the first line are the titles
nrow <- length(lines)-1
Let’s create our output data.frame
which will contain our scrapped table:
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
To get the content of a line, we get each line children’s text and remove the \n
line returns. Here for the column names:
lines[[1]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
## [1] "Division" "Team" "MLB Affiliation" "City"
## [5] "Stadium" "Capacity"
I can thus set the names of my scrapped table:
names(table) <- lines[[1]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
Let us have a look at the first and second line of the html table to understand how the cells spanning multiple lines are coded
lines[[2]]%>%
html_children()
## {xml_nodeset (6)}
## [1] <th rowspan="8">East\n</th>
## [2] <td>\n<b><a href="/wiki/Arizona_League_Angels" title="Arizona League Ange ...
## [3] <td>\n<a href="/wiki/Los_Angeles_Angels" title="Los Angeles Angels">Los A ...
## [4] <td>\n<a href="/wiki/Tempe,_Arizona" title="Tempe, Arizona">Tempe</a>\n</td>
## [5] <td>\n<a href="/wiki/Tempe_Diablo_Stadium" title="Tempe Diablo Stadium">T ...
## [6] <td align="right">9,785\n</td>
Here we have an attribute rowspan
telling us that the first column (Division) spans on height rows. Let us have a look at the second line:
lines[[3]]%>%
html_children()
## {xml_nodeset (5)}
## [1] <td>\n<b><a href="/wiki/Arizona_League_Athletics" title="Arizona League A ...
## [2] <td rowspan="2">\n<a href="/wiki/Oakland_Athletics" title="Oakland Athlet ...
## [3] <td rowspan="4">\n<a href="/wiki/Mesa,_Arizona" title="Mesa, Arizona">Mes ...
## [4] <td rowspan="2">\n<a href="/wiki/Fitch_Park" title="Fitch Park">Fitch Par ...
## [5] <td rowspan="2" align="right">10,000\n</td>
The second line of the table only have 5 columns, because the first columns is defined on the previous line and spanning on multiple rows. So the second line of the first table must be filled when reading the first line. The idea is thus to fill all the rows a column spanns on when we have the info.
Lets have an example for the first two lines:
The first line
The content of the first line is:
linecontent <- lines[[2]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
linecontent
## [1] "East" "Arizona League Angels" "Los Angeles Angels"
## [4] "Tempe" "Tempe Diablo Stadium" "9,785"
and the number of line each column spans on is:
repetition <- lines[[2]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric
repetition
## [1] 8 1 1 1 1 1
We see that the first column spans on 8 rows. All the idea is to fill already all the lines this column will spann on in our output table:
for(j in 1:length(repetition)){
span <- repetition[j]
table[(1):(span),j] <- rep(linecontent[j],span)
}
table %>% flextable()%>% theme_box()
Division | Team | MLB Affiliation | City | Stadium | Capacity |
East | Arizona League Angels | Los Angeles Angels | Tempe | Tempe Diablo Stadium | 9,785 |
East | |||||
East | |||||
East | |||||
East | |||||
East | |||||
East | |||||
East | |||||
The second line
Now let us have a look at the second line
linecontent <- lines[[3]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
linecontent
## [1] "Arizona League Athletics Gold" "Oakland Athletics"
## [3] "Mesa" "Fitch Park"
## [5] "10,000"
I have only 5 values, because the first column spans on several rows. I can proceed exactly as for the first row, but I must select the column with missing values in my output table, in order to have a vector of the same length as the info given by the html object:
colselect <- is.na(table[2,])
colselect
## Division Team MLB Affiliation City Stadium Capacity
## 2 FALSE TRUE TRUE TRUE TRUE TRUE
We get the number of line repetition for the concerned columnc
repetition <- lines[[3]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric
repetition
## [1] 1 2 4 2 2
and fill the empty columns, without forgetting to repeat them on several line if needed:
for(j in 1:length(repetition)){
span <- repetition[j]
table[(2):(1+span),colselect][,j] <- rep(linecontent[j],span)
}
table %>% flextable() %>% theme_box()
Division | Team | MLB Affiliation | City | Stadium | Capacity |
East | Arizona League Angels | Los Angeles Angels | Tempe | Tempe Diablo Stadium | 9,785 |
East | Arizona League Athletics Gold | Oakland Athletics | Mesa | Fitch Park | 10,000 |
East | Oakland Athletics | Mesa | Fitch Park | 10,000 | |
East | Mesa | ||||
East | Mesa | ||||
East | |||||
East | |||||
East | |||||
Line three of the html table gives only one value
lines[[4]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
## [1] "Arizona League Athletics Green"
and it is perfect, because I only have one missing value in my output table.
The solution
We can repeat the process on each lines:
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
names(table) <- lines[[1]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
# fill the table
for(i in 1:nrow){
# get content of the line
linecontent <- lines[[i+1]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
# get the line repetition of each columns
repetition <- lines[[i+1]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric
#select only free columns
colselect <- is.na(table[i,])
# repeat the cells of the multiple rows down
for(j in 1:length(repetition)){
span <- repetition[j]
if(sum(colselect)>1){
table[(i):(i+span-1),colselect][,j] <- rep(linecontent[j],span)
}else{table[(i):(i+span-1),colselect] <- rep(linecontent[j],span) }
}
}
table %>%
flextable() %>% theme_box()
Division | Team | MLB Affiliation | City | Stadium | Capacity |
East | Arizona League Angels | Los Angeles Angels | Tempe | Tempe Diablo Stadium | 9,785 |
East | Arizona League Athletics Gold | Oakland Athletics | Mesa | Fitch Park | 10,000 |
East | Arizona League Athletics Green | Oakland Athletics | Mesa | Fitch Park | 10,000 |
East | Arizona League Cubs 1 | Chicago Cubs | Mesa | Sloan Park | 15,000 |
East | Arizona League Cubs 2 | Chicago Cubs | Mesa | Sloan Park | 15,000 |
East | Arizona League Diamondbacks | Arizona Diamondbacks | Scottsdale | Salt River Fields at Talking Stick | 11,000 |
East | Arizona League Giants Black | San Francisco Giants | Scottsdale | Scottsdale Stadium | 12,000 |
East | Arizona League Giants Orange | San Francisco Giants | Scottsdale | Scottsdale Stadium | 12,000 |
Central | Arizona League Brewers Gold | Milwaukee Brewers | Phoenix | American Family Fields of Phoenix | 8,000 |
Central | Arizona League Dodgers Lasorda | Los Angeles Dodgers | Phoenix | Camelback Ranch | 12,000 |
Central | Arizona League Indians Blue | Cleveland Indians | Goodyear | Goodyear Ballpark | 10,000 |
Central | Arizona League Padres 2 | San Diego Padres | Peoria | Peoria Sports Complex | 12,882 |
Central | Arizona League Reds | Cincinnati Reds | Goodyear | Goodyear Ballpark | 10,000 |
Central | Arizona League White Sox | Chicago White Sox | Phoenix | Camelback Ranch | 12,000 |
West | Arizona League Brewers Blue | Milwaukee Brewers | Phoenix | American Family Fields of Phoenix | 8,000 |
West | Arizona League Dodgers Mota | Los Angeles Dodgers | Phoenix | Camelback Ranch | 12,000 |
West | Arizona League Indians Red | Cleveland Indians | Goodyear | Goodyear Ballpark | 10,000 |
West | Arizona League Mariners | Seattle Mariners | Peoria | Peoria Sports Complex | 12,882 |
West | Arizona League Padres 1 | San Diego Padres | Peoria | Peoria Sports Complex | 12,882 |
West | Arizona League Rangers | Texas Rangers | Surprise | Surprise Stadium | 10,500 |
West | Arizona League Royals | Kansas City Royals | Surprise | Surprise Stadium | 10,500 |
Et voilà !