COVID19 data: pdf scrapping

getting data

Since the COVID19 epidemic, a lot of data is published by state agency as pdf reports. For someone interested in data analysis, it is important to be able to extract these data out of the report. This is an example with COVID data in Spain given by the ministerio de sanidad. This are data by region.

Find the data

The first thing is to find one pdf report containing a report. Here is one I found, after a bit of searching, from the ministerio de sanidad page.

I contains the table I want to extract:

From the html address of the first link, you can try to guess the html address of the others, to automatize the pdf download. Here, the address https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/documentos/Actualizacion_35_COVID-19.pdf" contains the report 35. You could guess all reports can be accessed just changing the number 35 in the address.

Download the data

I don’t know what is the maximum report number, and I am not sure all reports are located on the same address format. A good way to try is to use the function download.file in a loop, with a tryCatch to avoid interrupting on the first error. To download such data in R, you can do the following:

filelist <- list.files(pattern = "report[0-9]+\\.pdf")
for(i in 1:100){
  if( ! paste0("report",i,".pdf") %in% filelist){
      tryCatch(download.file(paste0("https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/documentos/Actualizacion_",i,"_COVID-19.pdf"),
                destfile = paste0("report",i,".pdf"),mode = "wb"),error = function(e) e)
  }
}

Notice you need to specify the mode = "wb" in the download.file function to have readable pdf documents.

By checking the output (not shown here) you will now which pdf you managed to download or not. Here I managed to download the reports from 31 to 75, with 44 and 45 missing. Checking the report 75, It correspond to the last one, so I know the upper limit (I could have checked there). Doing some Google research with the URL name but index under 31, you will find that the first reports where under a slightly different html (and country specific) address: https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/documentos/Actualizacion_“,i,”_COVID-19_China.pdf

So knowing that, this should get the rest the first reports:

filelist <- list.files(pattern = "report[0-9]+\\.pdf")
for(i in 1:31){
  if( ! paste0("report",i,".pdf") %in% filelist){
      invisible(tryCatch(
        download.file(paste0("https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/documentos/Actualizacion_",i,"_COVID-19_China.pdf"),
                destfile = paste0("report",i,".pdf"),mode = "wb"),error = function(e) e))
  }
}

The report 44 is here , and the 45 is here.

Extract the tables

library(tabulizer)
## Warning: le package 'tabulizer' a été compilé avec la version R 3.6.3
library(lubridate)
library(data.table)
library(dplyr)
library(magrittr)
library(stringr)

Let’s see how we can extract the tables from a pdf with tabulizer. I will try with the report 36:

out <- extract_tables(paste0("report",36,".pdf"),encoding = "UTF-8")
out
## [[1]]
##       [,1]                            [,2]     [,3]     [,4]     [,5]   
##  [1,] ""                              ""       ""       ""       ""     
##  [2,] ""                              "Global" "China"  "Italia" "Irán" 
##  [3,] ""                              ""       ""       ""       ""     
##  [4,] "no casos"                      "93.062" "80.422" "2.502"  "2.336"
##  [5,] "IA últimos 14 días (casos"     ""       ""       ""       ""     
##  [6,] ""                              ""       "0,43"   "4,13"   "2,88" 
##  [7,] "100.000 habitantes)"           ""       ""       ""       ""     
##  [8,] "no nuevos casos (desde última" ""       ""       ""       ""     
##  [9,] ""                              "2.150"  "119"    "466"    "0"    
## [10,] "actualización 4/3/20 18:00)"   ""       ""       ""       ""     
## [11,] ""                              "3.198"  "2.984"  "80"     "77"   
## [12,] "no fallecidos (L)"             ""       ""       ""       ""     
## [13,] ""                              "(3,4)"  "(3,7)"  "(3,2)"  "(3,3)"
##       [,6]        [,7]      [,8]       [,9]    
##  [1,] "Corea del" ""        ""         ""      
##  [2,] ""          "Francia" "Alemania" "España"
##  [3,] "Sur"       ""        ""         ""      
##  [4,] "5.325"     "212"     "196"      "198"   
##  [5,] ""          ""        ""         ""      
##  [6,] "10,29"     "0,30"    "0,22"     "0,38"  
##  [7,] ""          ""        ""         ""      
##  [8,] ""          ""        ""         ""      
##  [9,] "513"       "8"       "8"        "47"    
## [10,] ""          ""        ""         ""      
## [11,] "32"        "4"       ""         "1"     
## [12,] ""          ""        "0"        ""      
## [13,] "(0,6)"     "(1,9)"   ""         "(0,6)" 
## 
## [[2]]
##       [,1]                 [,2]          [,3]                           
##  [1,] "CCAA"               "Total casos" "IA (casos/100.000 habitantes)"
##  [2,] "Andalucía"          "13"          "0,15"                         
##  [3,] "Aragón"             "0"           "0,00"                         
##  [4,] "Asturias"           "2"           "0,20"                         
##  [5,] "Baleares"           "5"           "0,43"                         
##  [6,] "Canarias"           "7"           "0,33"                         
##  [7,] "Cantabria"          "10"          "1,72"                         
##  [8,] "Castilla La Mancha" "12"          "0,59"                         
##  [9,] "Castilla y León"    "11"          "0,46"                         
## [10,] "Cataluña"           "15"          "0,20"                         
## [11,] "Ceuta"              "0"           "0,00"                         
## [12,] "C. Valenciana"      "19"          "0,38"                         
## [13,] "Extremadura"        "6"           "0,56"                         
## [14,] "Galicia"            "1"           "0,04"                         
## [15,] "Madrid"             "70"          "1,05"                         
## [16,] "Melilla"            "0"           "0,00"                         
## [17,] "Murcia"             "0"           "0,00"                         
## [18,] "Navarra"            "3"           "0,46"                         
## [19,] "País Vasco"         "17"          "0,77"                         
## [20,] "La Rioja"           "7"           "2,21"                         
## [21,] "Total"              "198"         "0,42"                         
##       [,4]             [,5]        
##  [1,] "Ingreso en UCI" "Fallecidos"
##  [2,] "1"              ""          
##  [3,] ""               ""          
##  [4,] "1"              ""          
##  [5,] ""               ""          
##  [6,] ""               ""          
##  [7,] ""               ""          
##  [8,] "1"              ""          
##  [9,] "1"              ""          
## [10,] ""               ""          
## [11,] ""               ""          
## [12,] ""               "1"         
## [13,] ""               ""          
## [14,] ""               ""          
## [15,] "2"              ""          
## [16,] ""               ""          
## [17,] ""               ""          
## [18,] "1"              ""          
## [19,] ""               ""          
## [20,] ""               ""          
## [21,] "7"              "1"         
## 
## [[3]]
##       [,1]        [,2]    [,3]              [,4]    [,5]
##  [1,] "País"      "Casos" "País"            "Casos" ""  
##  [2,] "Alemania"  "196"   "Lituania"        ""      "1" 
##  [3,] "Austria"   "24"    "Luxemburgo"      ""      "1" 
##  [4,] "Bélgica"   "8"     "Países Bajos"    ""      "28"
##  [5,] "Croacia"   "9"     "Portugal"        ""      "2" 
##  [6,] "Dinamarca" "8"     "República Checa" ""      "5" 
##  [7,] "España"    "198"   "Rumanía"         ""      "4" 
##  [8,] "Estonia"   "2"     "Suecia"          ""      "24"
##  [9,] "Finlandia" "7"     "Islandia"        ""      "16"
## [10,] "Francia"   "212"   "Noruega"         ""      "32"
## [11,] "Grecia"    "7"     "Suiza"           ""      "37"
## [12,] "Irlanda"   "2"     "Reino Unido"     ""      "51"
## [13,] "Italia"    "2502"  "San Marino"      ""      "8" 
## [14,] "Letonia"   "1"     "Mónaco"          ""      "1" 
## [15,] "Total"     ""      "3377"            ""      ""  
## 
## [[4]]
##      [,1]                     [,2]                       
## [1,] "SECRETARIA GENERAL"     ""                         
## [2,] "DE SANIDAD"             "Centro de Coordinación de"
## [3,] "DIRECCIÓN GENERAL DE"   "Alertas y Emergencias"    
## [4,] "SALUD PÚBLICA, CALIDAD" "Sanitarias"               
## [5,] "E INNOVACIÓN"           ""

We have three different tables here. I want the one with CCAA data here (in this report the second one). As I don’t know if the order of these table is always the same, I try to find them automatically by looking for the word “CCAA” in the tables (Autonomous communities, see here)

whichtable <- sapply(out,function(dt){
  any(grepl("CCAA",dt))
})
whichtable
## [1] FALSE  TRUE FALSE FALSE

Here I see that the second one is the one I want. I select it, and transform it to a data.table (or a data frame, does not make much difference).

table <- out[[which(whichtable)[1]]]
df <- as.data.table(table[-1,])
setnames(df,table[1,])
head(df)
##         CCAA Total casos IA (casos/100.000 habitantes) Ingreso en UCI
## 1: Andalucía          13                          0,15              1
## 2:    Aragón           0                          0,00               
## 3:  Asturias           2                          0,20              1
## 4:  Baleares           5                          0,43               
## 5:  Canarias           7                          0,33               
## 6: Cantabria          10                          1,72               
##    Fallecidos
## 1:           
## 2:           
## 3:           
## 4:           
## 5:           
## 6:

The date of the result is in the beginning of the report. I can extract it using pdf_text, which extract all the text as a single string, which allows me to use the stingr functions:

library(pdftools)
## Warning: le package 'pdftools' a été compilé avec la version R 3.6.3
date <- pdf_text("report36.pdf") %>%
  str_extract("[0-9]{1,2}\\.[0-9]{1,2}\\.2020") %>% # the date is d.m.yyyy or d.mm.yyyy or dd.mm.yyyy
  .[1] %>% # if extract several dates, I keep the first one
  dmy() # transform it to Date with lubridate
date
## [1] "2020-03-04"

I can now do the same for all reports by looping on all pdfs (be patient, the extract_tables function takes some time):

filelist <- list.files(pattern = "report[0-9]+\\.pdf")
table_list <- lapply(filelist,function(reportfile){
  print(reportfile)
  out <- extract_tables(reportfile,encoding = "UTF-8")
  whichtable <- sapply(out,function(dt){
    any(grepl("CCAA",dt) & grepl("CCAA",dt))
  })
  if(any(whichtable)){ # if a table of interest
      table <- out[[which(whichtable)[1]]]
  df <- as.data.table(table[-1,])
  setnames(df,table[1,])
  }else(df = data.table())

  date <- pdf_text(reportfile) %>%
    str_extract("[0-9]{1,2}\\.[0-9]{1,2}\\.2020") %>%
    .[1] %>%
    dmy()
  list(df = df,date_report = date)
})

before merging, we want to have a look at the tables names, to inspect if they are the same:

lapply(table_list,function(x){names(x$df)}) %>% unique()
## [[1]]
## character(0)
## 
## [[2]]
## [1] "CCAA"                          "Total casos"                  
## [3] "IA (casos/100.000 habitantes)" "Ingreso en UCI"               
## [5] "Fallecidos"                   
## 
## [[3]]
## [1] "CCAA"           "Total casos"    "IA Total*"      "Ingreso en UCI"
## [5] "Fallecidos"    
## 
## [[4]]
## [1] "CCAA"               "Total casos"        "IA últimos 14 días"
## [4] "Ingreso en UCI"     "Fallecidos"        
## 
## [[5]]
## [1] ""                                "Casos IA últimos 14 días (casos"
## [3] "Ingresados en"                   ""                               
## [5] ""                               
## 
## [[6]]
## [1] ""       ""       ""       ""       ""       ""       "Nuevos"
## 
## [[7]]
## [1] "CCAA"           "TOTAL conf."    "IA (14 d.)"     "Hospitalizados"
## [5] "UCI"            "Fallecidos"     "Curados"        "Nuevos"        
## 
## [[8]]
## [1] ""              ""              ""              "Casos que han"
## [5] "Casos que han" ""              ""              ""

Some of the pdf table extraction did not catch the titles. I will drop the last column and set names

for(i in 34:54){
  names(table_list[[i]]$df) <- c("CCAA","TOTAL conf.","IA (14 d.)","Hospitalizados","UCI","Fallecidos","Curados","Nuevos")  
}
for(i in 30:31){
  names(table_list[[i]]$df) <- c("CCAA","TOTAL conf.","IA (14 d.)","Hospitalizados","UCI","Fallecidos","Nuevos")  
}
 names(table_list[[24]]$df) <- c("","CCAA","TOTAL conf.","Fallecidos","")  

The, as the column names differ between some tables, I will grep the column names of interest. The column I want are the region names CCAA, the total cases, the intensive care unit UCI, and the deaths “Fallecidos”. I will, for each table, get the lines after “CCAA” appears in the CCAA column, as it is were the real table starts; Example:

head(table_list[[35]]$df)
##         CCAA  TOTAL conf. IA (14 d.)   Hospitalizados            UCI Fallecidos
## 1:                  TOTAL                                                      
## 2:      CCAA              IA (14 d.)        precisado      precisado Fallecidos
## 3:           confirmados*                                                      
## 4:                                   hospitalización* ingreso en UCI           
## 5: Andalucía        3.406      38,96            1.626            134        134
## 6:    Aragón        1.116      79,43              562             93         48
##    Curados Nuevos
## 1:               
## 2: Curados Nuevos
## 3:               
## 4:               
## 5:      77    396
## 6:       4    209

Here the column name were not extracted properly, and the table actually starts at line 5, so just after “CCAA” appears in the CCAA column.

df_tot <- lapply(table_list,function(x){
  if(dim(x$df)[1]>1){ # if the table is empty, i.e. the report did not ave table
  plouf <- x$df
  if(any(grepl("CCAA",plouf$CCAA))){ # if there is "CCAA" character in the "CCAA" column
   plouf <- plouf [plouf[,.I[CCAA == "CCAA"]+1]:plouf[,.N]]
  }
  dfnames <- tolower(names(plouf)) # to grep on lower cases
  # find the column of interest
  if(any(grepl("uci",dfnames))){
      cols <- c("CCAA", 
            names(plouf)[grep("total casos|total conf",dfnames)],
            names(plouf)[grep("uci",dfnames)],
            names(plouf)[grep("fallecido",dfnames)]
            )
  #select the clumns of interest
  plouf <- plouf[,.SD,.SDcols = cols]
  names(plouf) <- c("CCAA","casos","UCI","fallecidos")
  }else{
          cols <- c("CCAA", 
            names(plouf)[grep("total casos|total conf",dfnames)],
            names(plouf)[grep("fallecido",dfnames)]
            )
  #select the clumns of interest
  plouf <- plouf[,.SD,.SDcols = cols]
  names(plouf) <- c("CCAA","casos","fallecidos")
  }

  # setnames(plouf,cols,c("CCAA","casos","UCI","fallecidos"))
  # dd the date
  plouf[,date := x$date_report]
  }
}) %>%
rbindlist(.,fill = T,use.names = T) # bind all together

I am using the rbindlist function from data.table, that I use a lot, but you can with data frame use do.call(rbind,list(...)). I obtain my finale table:

head(df_tot)
##         CCAA casos UCI fallecidos       date
## 1: Andalucía    13   1            2020-03-04
## 2:    Aragón     0                2020-03-04
## 3:  Asturias     2   1            2020-03-04
## 4:  Baleares     5                2020-03-04
## 5:  Canarias     7                2020-03-04
## 6: Cantabria    10                2020-03-04

data management

As always, there is data management to do before having clean data. I will show you some mandatory steps when dealing with this kind of data. First, let’s check the CCAA we have

df_tot$CCAA %>% unique()
##  [1] "Andalucía"          "Aragón"             "Asturias"          
##  [4] "Baleares"           "Canarias"           "Cantabria"         
##  [7] "Castilla La Mancha" "Castilla y León"    "Cataluña"          
## [10] "Ceuta"              "C. Valenciana"      "Extremadura"       
## [13] "Galicia"            "Madrid"             "Melilla"           
## [16] "Murcia"             "Navarra"            "País Vasco"        
## [19] "La Rioja"           "Total"              "Castilla-La Mancha"
## [22] ""                   "ESPAÑA"

Here I want to remove the lines with “ESPANA”, the one with empty spaces "" and the “Total” (I will calculate it myself), and correct for the two “Castilla La Mancha” and “Castilla-La Mancha”.

df_tot <- df_tot[!CCAA %in% c("","ESPAÑA","Total")]
df_tot[,CCAA := gsub("-"," ",CCAA)]
df_tot$CCAA %>% unique()
##  [1] "Andalucía"          "Aragón"             "Asturias"          
##  [4] "Baleares"           "Canarias"           "Cantabria"         
##  [7] "Castilla La Mancha" "Castilla y León"    "Cataluña"          
## [10] "Ceuta"              "C. Valenciana"      "Extremadura"       
## [13] "Galicia"            "Madrid"             "Melilla"           
## [16] "Murcia"             "Navarra"            "País Vasco"        
## [19] "La Rioja"

Then I have seen that they use dots to separate the thousand in the number, which could result in big mistake if I transform directly to numeric. I thus clean the numeric variables by removing dots, spaces, and then convert it to numeric:

df_tot[,c("casos","UCI","fallecidos") := lapply(.SD,function(x){
  x %>% 
    gsub("\\.","",.) %>%
    gsub(" ","",.) %>%
    as.numeric
}),.SDcols = c("casos","UCI","fallecidos")]
## Warning in function_list[[k]](value): NAs introduits lors de la conversion
## automatique

## Warning in function_list[[k]](value): NAs introduits lors de la conversion
## automatique

Some plots:

Now that I have the data, I can do some (quick) plots: Here the evolution of deaths per region in spain

library(ggplot2)
ggplot(df_tot,aes(date,fallecidos,color = CCAA))+
  geom_point()+
  geom_line()+
  theme_bw()+
  scale_y_log10()+
  guides(color = guide_legend(ncol = 2))+
  labs(title = "Deaths evolution in Spain")
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 19 rows containing missing values (geom_point).
## Warning: Removed 18 rows containing missing values (geom_path).

Here a plot comparing the deaths with the confirmed cases in each spanish region:

library(ggplot2)
ggplot(df_tot[date == max(date)])+
  geom_point(aes(casos,fallecidos,color = CCAA),size = 4)+
  geom_line(aes(casos,casos*0.1),linetype = "dashed")+
  geom_line(aes(casos,casos*0.05),linetype = "dashed")+
  theme_bw()+
  scale_y_log10()+
  scale_x_log10()+
  guides(color = guide_legend(ncol = 2))+
  labs(title = "Deaths versus cases in Spain")+
  annotate(geom = "curve",
            x = 1000, y = 1000,
            xend = 1000, yend = 100,
            curvature = .2,
            color="black",
            size = 0.5,
            arrow = arrow(length = unit(3, "mm"))) +
   annotate(geom = "text",
            x = 1000, y = 2000,
            label = "10% case fatality rate",
            hjust = "center",
            color="black")+
    annotate(geom = "curve",
            x = 1000, y = 5,
            xend = 1000, yend = 50,
            curvature = .2,
            color="black",
            size = 0.5,
            arrow = arrow(length = unit(3, "mm"))) +
   annotate(geom = "text",
            x = 1000, y = 4,
            label = "5% case fatality rate",
            hjust = "center",
            color="black")

Avatar
Denis Mongin
Physicist, Data scientist

Related