Forbes cohort

How to use the data scraped from Forbes to understand how much did the billionaires gain

Mar 28, 2021 R

In beginning of 2021, when trying to know how much did billionaires gain in 10 years, I have seen people comparing the total amount of money owned by all the billionaires in 2010 and 2020, which is wrong (there are more in 2020 than in 2010, so the increase does not represent the average personal gain of the billionaires), and others comparing the mean of the billionaires’ fortune in 2020 and 2010, which is also wrong (because there are a lot of newcomers in the 2020 list, lowering substancially the mean of their fortune).

The proper way to answer the question how much did billionaires gain in 10 years is to construct a cohort, that is find billionaires from 2010 in 2020, calculate their individual fortune change and make some descriptive statistic with this info.

From my previous post, we did get from Forbes the list fo billionaires in 2020 and 2010. I will present here some data management to get a cohort from these data and get some (correct) idea about how much did the billionaires from 2010 gain.

The libraries I will use:

library(stringdist) # to recognize names
library(stringr) # to do some regex stuff
library(ggplot2) # to plot
library(data.table) # to do data handling. If you prefer dplyr, it is ok too
rm(list = ls())

I use `data.table, because I think this is the best R package to handle data in general. I like dplyr too, but data.table is better, faster, and has a shorter and more consistent syntax, which I like. See this post for some equivalence between the two, and this famous question for the comparison between dplyr and data.table ;).

I use str_extract() from {stringr} to extract numerical values

forbes2020[,money := as.numeric(str_extract(money,"[0-9.]+"))]
forbes2020[,age := as.numeric(str_extract(age,"[0-9.]+"))]
forbes2020[,rank := as.numeric(rank)]

and also rename the variable from 2010:

setnames(forbes2010,
         c("Rank","Name","Citizenship","Age", "Net Worth ($bil)"),
         c("rank","name","country","age","money"))
forbes2010[,money := as.numeric(str_extract(money,"[0-9.]+"))]
forbes2010[,age := as.numeric(str_extract(age,"[0-9.]+"))]
forbes2010[,rank := as.numeric(rank)]

To recognize names in the two data base, I cannot rely on simply comparing the strings: I will for sure miss some, because of upper case letters, spaces, etc. Therefore, I create a function to normalize the names, in order to compare them:

normalize_name = function(x){
  tolower(x) %>%
    stringi::stri_trans_general(.,"Latin-ASCII") %>% # remove accents
    gsub("&|and|family","",.)%>% 
    gsub("-"," ",.) %>% # replace - by space
    gsub("[ ]{1,}"," ",.) %>% # only one space
  gsub("^[ ]+|[ ]+$","",.)  # remove leading and trailing spaces
}

Here, the function removes all accents, remove some words (&,and,family), and remove additional spaces. I use gsub because of a persistent habit, but str_replace works the same. An example:

normalize_name(c("Martin Machin-truc des écoles ","J Smith  Junior and Family"))

## [1] "martin machin truc des ecoles" "j smith junior"

Now I apply it on my data:

forbes2010[,name := normalize_name(name)]
forbes2020[,name := normalize_name(name)]

To compare the names, I use the library stringdist, to compute the string distance matrix:

dist_mat <- stringdistmatrix(forbes2010$name,forbes2020$name,method = "jw")
dim(dist_mat)

## [1]  950 2325

I have a matric with one line per names in the Forbes 2010 list, and one row per names in the 2020 list. The jw method gives a score between 0 and 1, 0 being identical strings and 1 being different ones. I construct the correspondence table by taking the row and column values of below a small threshold:

corr <- as.data.table(which(dist_mat <= 0.05,arr.ind = TRUE))
head(corr)

##    row col
## 1: 932   3
## 2: 206   5
## 3: 928   8
## 4:   2   9
## 5:   8  10
## 6: 949  11

I obtain thus here the correspondence between the lines from the 2010 list (the rows) and the 2020 list (the col). Let us check: I sample 10 lines of my correspondence table, an bind the name of the two list to compare them:

set.seed(184)
test <- sample(1:dim(corr)[1],10)
cbind(forbes2010[corr[,row],name][test],forbes2020[corr[,col],name][test])

##       [,1]               [,2]              
##  [1,] "john menard jr"   "john menard jr"  
##  [2,] "charles bronfman" "charles bronfman"
##  [3,] "alec gores"       "alec gores"      
##  [4,] "michael herz"     "michael herz"    
##  [5,] "nassef sawiris"   "nassef sawiris"  
##  [6,] "glenn dubin"      "glenn dubin"     
##  [7,] "patrick lee"      "patrick lee"     
##  [8,] "august von finck" "august von finck"
##  [9,] "juan abello"      "juan abello"     
## [10,] "james dinan"      "james dinan"

I can sometime have two correspondence: in this case, I take the smallest score of the distance. I thus calculate the number of correspondence for each line (name in 2010 Forbes list)

corr$score <- mapply(function(x,y) dist_mat[x,y],corr[,row],corr[,col])
corr[,Ncoresp := .N,by = row]

And I take the line with the smallest score

cbind(forbes2010[corr[Ncoresp>1,row],name],forbes2020[corr[Ncoresp>1,col],name])

##       [,1]             [,2]            
##  [1,] "leonard lauder" "leonard lauder"
##  [2,] "ronald lauder"  "leonard lauder"
##  [3,] "jim davis"      "jim davis"     
##  [4,] "michael herz"   "michael herz"  
##  [5,] "leonard lauder" "ronald lauder" 
##  [6,] "ronald lauder"  "ronald lauder" 
##  [7,] "jim davis"      "jim davis"     
##  [8,] "robert miller"  "robert miller" 
##  [9,] "robert miller"  "robert miller" 
## [10,] "michael herz"   "michaela herz"

corr <- corr[,.SD[score == min(corr)],by = row]

I then create a unique identifier for these perfect matches:

corr[,id := paste0("perfect",1:.N)]
forbes2010[corr$row,id := corr$id]
forbes2020[corr$col,id := corr$id]
head(corr)

##    row col score Ncoresp       id
## 1: 932   3     0       1 perfect1
## 2: 206   5     0       1 perfect2
## 3: 928   8     0       1 perfect3
## 4:   2   9     0       1 perfect4
## 5: 949  11     0       1 perfect5
## 6: 929  12     0       1 perfect6

From the 950 initial billionaires, I found

forbes2010[!is.na(id),.N]

## [1] 586

of them in the 2020 list. I am still missing some. I will now try to identify the family names for those without correspondence: I want to catch the misspelled surnames, and the family heritages. The family name is the last word in the name, once removed the Jr, sr, ii and so on:

forbes2010[,lastname := str_extract(gsub(" jr$| sr$| [i]{1,}$","",name),"\\w+$")]
forbes2020[,lastname := str_extract(gsub(" jr$| sr$| [i]{1,}$","",name),"\\w+$")]

forbes2020[,Nlastname := nchar(lastname),by = lastname]
forbes2010[,Nlastname := nchar(lastname),by = lastname]

I re-do a string distance matrix, only for family names having more than 4 letters (for the smaller one, there is too much risk of making a mistake for one letter difference)

familiy_name_lim <- 4
plouf2 <- stringdistmatrix(forbes2010[is.na(id) & Nlastname > familiy_name_lim,lastname],
                           forbes2020[is.na(id)& Nlastname > familiy_name_lim,lastname],method = "jw")
corr2 <- as.data.table(which(plouf2 < 0.02,arr.ind = TRUE))

Let us see what we got:

# verifs
cbind(forbes2010[is.na(id) & Nlastname > familiy_name_lim][corr2[,row],name[1:10]],
      forbes2020[is.na(id) & Nlastname > familiy_name_lim][corr2[,col],name[1:10]])

##       [,1]                      [,2]               
##  [1,] "jeffrey bezos"           "jeff bezos"       
##  [2,] "william gates iii"       "bill gates"       
##  [3,] "lawrence ellison"        "larry ellison"    
##  [4,] "steven ballmer"          "steve ballmer"    
##  [5,] "s. robson walton"        "rob walton"       
##  [6,] "philip knight"           "phil knight"      
##  [7,] "michele ferrero"         "giovanni ferrero" 
##  [8,] "alain gerard wertheimer" "alain wertheimer" 
##  [9,] "alain gerard wertheimer" "gerard wertheimer"
## [10,] "james simons"            "jim simons"

In 2010, Jeff Bezos was called Jeffrey, Bill Gates was called William gates 3 (which is actually his name, see https://en.wikipedia.org/wiki/Bill_Gates ), and Alina and Gerard Wertheimer were consider the same person whereas they are two different entries in 2020, etc. As I want to compare their fortune, I will identif the entry alain gerard wertheimer as being the equivalent of both alain wertheimer + gerard wertheimer. I assign a family identifier in both tables:

corr2[,id := paste0("family",.GRP),by  = row]
forbes2010[forbes2010[is.na(id) & Nlastname > familiy_name_lim][corr2[,row]],
           id := corr2$id,
           on = "name"]
forbes2020[forbes2020[is.na(id) & Nlastname > familiy_name_lim][corr2[,col]],
           id := corr2$id,
           on = "name"]

I found 108 supplementary persons:

forbes2010$id %>% str_extract("[a-z]+") %>% table()

## .
##  family perfect 
##     106     586

I check now the missing “big” billionaires:

forbes2010[is.na(id) & money > 10,name]

## [1] "david koch"                       "kwok"                            
## [3] "anil ambani"                      "gerald cavendish grosvenor"      
## [5] "forrest mars jr"                  "liliane bettencourt"             
## [7] "prince alwaleed bin talal alsaud"

I can correct some by hand; Bettencourt changed for Bettencourt Meyers, some Kuehne transformed into Kuhne etc

forbes2020[grepl("wuerth",name),id := "wuerth"] 
forbes2010[grepl("david koch",name),id := "koch"]
forbes2020[grepl("julia koch",name),id := "koch"]
forbes2010[name == "forrest mars jr",id := "mars"]
forbes2020[name == "victoria mars" | name =="valerie mars" | name =="marijke mars",id := "mars"]
forbes2010[grepl("wurth",name),id := "wuerth"] 
forbes2020[grepl("kuehne",name),id := "kuhne"] 
forbes2010[grepl("kuhne",name),id := "kuhne"] 
forbes2020[grepl("bettencourt",name),id := "bettencourt"] # changement de nom
forbes2010[grepl("bettencourt",name),id := "bettencourt"] # changement de nom

I want to see the fortune change, so I remove those who died and did not give their money their children (who became billionaires). I consider that billionaires of more than 85 in 2010 that do not appear in 2020 are dead:

forbes2010[(is.na(id) & age > 85),.N]

## [1] 30

They are 30. I remove these dead persons:

forbes2010 <- forbes2010[!grepl("grosvenor",name)] # dead
forbes2010 <- forbes2010[!grepl("dorothea steinbruch",name)] # dead
forbes2010 <- forbes2010[!grepl("john kluge",name)] # dead
forbes2010 <- forbes2010[!(is.na(id) & age > 85)]

I consider the others as missing in the 2020 list and suppose they lost all their money (that is why they are not in the list)

forbes2010[is.na(id),id := paste0("missing",1:.N)]
forbes2010$id %>% str_extract("[a-z]+") %>% table()

## .
## bettencourt      family        koch       kuhne        mars     missing 
##           1         106           1           1           1         210 
##     perfect      wuerth 
##         586           1

I then calculate for each list the fortune per id (i.e. including family if I did not find them)

forbes2010_red <- forbes2010[,.(money = sum(money),name = name[1]),by = id]
forbes2010_red[,year := 2010]
forbes2020_red <- forbes2020[!is.na(id),.(money = sum(money)),by = id]
forbes2020_red[,year := 2020]

I then create the cohort:

tot <- merge(forbes2010_red[,.(id,name,money_2010 = money)],
             rbind(forbes2020_red[!is.na(id),.(id,money_2020 = money)],
                   forbes2010_red[grepl("missing",id),.(money_2020 = 0,id)])
             ) 
head(tot)

##             id                name money_2010 money_2020
## 1: bettencourt liliane bettencourt       20.0       71.4
## 2:     family1       jeffrey bezos       12.3      190.2
## 3:    family10    alexei mordashov        9.9       22.8
## 4:   family101         fahd hariri        1.4        2.4
## 5:   family102          joan tisch        2.7        3.5
## 6:   family103        hubert burda        2.5        1.3

I save the cohort. You can find the data here (if you are only interested by the data visualization).

fwrite(tot,file = "forbes_cohort.csv")

The cohort are all billionaires from 2010, with their fortune in 2020. Those I did not found in 2020 have 0 money. We can then calculate their fortune change: I calculate the difference in billion $, the difference considering 20% inflation in 10 years, and the same but in percentage.

diffdf <- tot[,.(diff = money_2020 - money_2010,
              diff_infl = money_2020 - money_2010*1.2,
              diff_percent = round((money_2020 - money_2010)/ money_2010*100,1),
              diff_percent_infl = round((money_2020 - money_2010*1.2)/ (money_2010*1.2)*100,1)),
              by = id]
diffdf %>% summary()

##       id                 diff           diff_infl        diff_percent    
##  Length:894         Min.   :-21.600   Min.   :-27.000   Min.   :-100.00  
##  Class :character   1st Qu.: -1.200   1st Qu.: -1.560   1st Qu.: -57.02  
##  Mode  :character   Median :  0.700   Median :  0.260   Median :  31.95  
##                     Mean   :  3.208   Mean   :  2.470   Mean   :  87.22  
##                     3rd Qu.:  3.100   3rd Qu.:  2.515   3rd Qu.: 136.82  
##                     Max.   :177.900   Max.   :175.440   Max.   :5041.70  
##  diff_percent_infl
##  Min.   :-100.00  
##  1st Qu.: -64.17  
##  Median :   9.95  
##  Mean   :  56.01  
##  3rd Qu.:  97.38  
##  Max.   :4184.70

We see here that the median increase of the billionaires’ fortune is 700 million $, the mean around 3.2 billion. In percent, the median is 87% increase, and around 55% when corrected from inflation. The biggest increases are:

bigest_gain <- diffdf[order(-diff)][1:10,id]
tot[id %in% bigest_gain]

##            id              name money_2010 money_2020
##  1:   family1     jeffrey bezos       12.3      190.2
##  2:   family2 william gates iii       53.0      120.5
##  3:   family3  lawrence ellison       28.0       88.9
##  4:   family4    steven ballmer       14.5       78.9
##  5:   family5  s. robson walton       19.8       84.4
##  6:  perfect1   bernard arnault       27.5      148.8
##  7: perfect11           jack ma        1.2       61.7
##  8:  perfect2   mark zuckerberg        4.0       99.6
##  9:  perfect4        larry page       17.5       80.2
## 10:  perfect8        ma huateng        3.6       68.0

It is weird not to see Elon Musk in a top 10 about fortune, but he actually was not a billionaire in 2010… The biggest losses are:

bigest_loss <- diffdf[order(diff)][1:10,id]
tot[id %in% bigest_loss]

##             id                             name money_2010 money_2020
##  1:  family105                   ingvar kamprad       23.0        3.3
##  2:   family71                     eike batista       27.0        5.4
##  3:   family95                       paul allen       13.5        1.5
##  4:   missing1                      anil ambani       13.7        0.0
##  5:   missing2               mohammed al amoudi       10.0        0.0
##  6: missing210 prince alwaleed bin talal alsaud       19.4        0.0
##  7: perfect146                   christy walton       22.5        7.8
##  8: perfect307                     michael otto       18.7        3.7
##  9: perfect458                 shashi ravi ruia       13.0        2.2
## 10:  perfect84                   lakshmi mittal       28.7       13.8

It seems right: Anil Ambani bankrupted, Mohamed al Amoudi does not seem to be in Forbes 2020 list (?), prince Alwaleed Bin Talal Alsaud has been removed too, etc.

Here is a simple data viz of the fortune increase in percentof all the billionaires of 2010 from our (not perfect) cohort:

diff_plot <- rbind(diffdf[,.(year = 2020,id,diff_percent,diff)],
                   data.table(diff_percent = 0,diff = 0,year = 2010,id = diffdf$id)) 
ymin = -100
ymax = 600
p <- diff_plot %>%
  ggplot(aes(as.factor(year),diff_percent,group = id))+
  geom_hline(yintercept = 0,linetype = "dotted",size = .2)+
  geom_line(size = .4,alpha = 0.05)+
  geom_vline(xintercept = 2,size = .3)+
  coord_cartesian(ylim = c(ymin,ymax),xlim = c(1,2))+
  theme_light()+
  scale_y_continuous(breaks = seq(-ymin,ymax,100))+
  theme(panel.grid.major.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.title.y = element_text(margin = margin(r = 3,unit = "mm")),
        plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5))+
  labs(x = "",y = "% change of fortune",
       title = "How much did the 907 billionaires\n of 2010 gain in 10 years",
       subtitle = "Median increase of 30%, mean increase of 86%",caption = "@denis_mongin\ndata: Forbes")

arrowdf <- diffdf[order(-diff_percent)][1:6][,.(xend = 1+600/diff_percent,
                                        xarrow = 0.7,id,yend = 600,
                                        yarrow = 600 - (.I-1)*100)]
arrowdf[forbes2010[,.(name,id)],name := i.name,on = "id"]

 p + 
  geom_curve(data = arrowdf,
             aes(x = xarrow,xend = xend,y = yarrow,yend = yend),
             curvature = -.1,arrow =arrow(length = unit(2, "mm"),type = "closed"),
             color = "grey60" ,size = .2,alpha = .6)+
  geom_text(data = arrowdf,aes(xarrow-0.1,yarrow - 20,label = name),size = 3,alpha = 0.5)+
  annotate(geom = "curve", 
           x = 2.2, y = 100, 
           xend = 2, yend = -100, 
           curvature = -.2, 
           color="grey50",
           size = .2,alpha = .6,
           arrow = arrow(length = unit(2, "mm"),type = "closed"))+
  annotate(geom = "text",
           x = 2.25, y = 220,
           label = "
           We suppose that
           all billionaires from 2010
           not in the 2020 list
           have lost
           all their fortune
           ",
           hjust = "center",
           color="grey50",alpha = .6,size = 3)

Et voilà. This a rather conservative approach, in the sense that I surely missed some of the billionaires, and considered that they lost all their fortune. So if there is a bias, I underestimate the billionaire’ gain.

R data.table epidemio

Forbes cohort

Denis Mongin

Physicist, Data scientist

Related