How to find someone in a data frame

An important technical problem when updating personal data

Context

During the COVID19 outbreak in Geneva, I worked at état de Genève as a data scientist. We set up the follow-up system of COVID19 patient in Geneva, based on redcap to allow to centrally store the data stemming from the different labs, from the hospitals, from the state. We used for all the data management and data handling, because I master it properly, because there are package allowing to interact directly with redcap via its API. But with the outbreak going on, we rapidly faced a problem: how to properly identify a person when you have different source of data? We needed to/

  • detect potential duplicated person in our database
  • update data that were already stored in redcap with new results, but coming from different sources, where the name and surname were sometime not exctly the same.

And actually both point are the same issue: how to identify a person.

I will detail here the different step of the solution I implemented, beginning with the second point: how to detect duplicated person in a data frame. It involves a bit of regex, some fuzzy matching concepts, some clustering analysis, a bit of graph analysis, and some nice logical tricks.

The libraries:

library(data.table)
library(dplyr)
library(igraph)
library(stringdist)
library(qualV)

# for this page:
library(DT)
library(viridis)

I use data.table, but little is required. I strongly encourage using this package when dealing with data (see here for a nice discussion about it). You will find nice explanation and examples in the vignettes of this package, but here are the basics of what I will use:

dt <- data.table(a = sample(1:10,10,replace = T),ID = sample(1:3,10,replace = T))
dt
##      a ID
##  1:  9  1
##  2:  6  1
##  3: 10  1
##  4:  1  2
##  5:  6  3
##  6:  9  2
##  7:  7  2
##  8:  8  1
##  9:  7  2
## 10:  4  2

To assigne a new column:

dt[,b := a + 5]

To subset:

dt[a == 6]
##    a ID  b
## 1: 6  1 11
## 2: 6  3 11

To do grouping operations:

dt[,mean(a),by  = ID]
##    ID   V1
## 1:  1 8.25
## 2:  2 5.60
## 3:  3 6.00

I use .GRP which is a help returning the grouping index:

dt[,.GRP,by = ID]
##    ID GRP
## 1:  1   1
## 2:  2   2
## 3:  3   3
dt[,.GRP,by = .(ID,a)]
##    ID  a GRP
## 1:  1  9   1
## 2:  1  6   2
## 3:  1 10   3
## 4:  2  1   4
## 5:  3  6   5
## 6:  2  9   6
## 7:  2  7   7
## 8:  1  8   8
## 9:  2  4   9

That’s it.

Let’s find them

I will detail the steps on a simple example:

df <- data.table(surname = c("Mongin","mongin","smith","De la cruz","de-la-cruz","smith","mongin","SMITH","smith","Smith","Cruz"),
           name = c("denis","denis diego","paul","sofia","sophia, Elèna","Paul","dénis","Paul ","Etienne","Paule","sofia"),
           birthdate = c("29-08-1986","29-08-1986","02-06-1945","02-03-2000","02-03-2000","20-06-1945","29-07-1986","02-06-1945","02-06-1945","02-06-1945","02-03-2000"))


Here we have 4 people, and we want to identify them properly. The name sometime changes, because some data store all the names when other just the first, they are some typos on the birthdates, etc. Real life data.

Normalize the names and surnames

The first thing to do is to try to get the name and surname on the same format: removing white spaces, punctuation marks, accents, and putting everything lower case:

clean_name <- function(name) {
  name_clean <- name %>%
    as.character() %>%
    tolower(.) %>%
    chartr("âäëéèêóôöüûçîï", "aaeeeeooouucii", .) %>%
    gsub("\\bla\\b|\\bde\\b|\\bda\\b|\\bdu\\b|\\bl'\\b", "", .) %>%
    gsub("[ ]+|[-]|[,]|[\\.]", "", .)
  return(name_clean)
}
clean_name(df$name)
##  [1] "denis"       "denisdiego"  "paul"        "sofia"       "sophiaelena"
##  [6] "paul"        "denis"       "paul"        "etienne"     "paule"      
## [11] "sofia"
cat("\n\n")
clean_name("testéDetru  -bizarùe.")
## [1] "testedetrubizarùe"

Here:

  • chartr change one character for another. I use it to remove the accents.
  • In regex: -\\b indicates a word boudary: I remove all de, du and other small word of composed surnames.
    • | is for or

It is acually a good start, but it is not enough: If I clean names and surnames in the data:

df[,name_clean := clean_name(name)]
df[,surname_clean := clean_name(surname)]

and associate a person index for all person with same cleaned name, surname and birthdate:

df[,person := .GRP,by = .(name_clean,surname_clean,birthdate)]

I recognize some of these persons, but I miss most of them:

fuzzy matching

The comparison function

So, I need to let a bit of variation in my character recognition. There are quite a lot of possibility with the library stringdist, but most of them did not solve the problem I had with the name, which is the following: denis and denisdiego should be the same, but denis and yanis should not. If I try

stringdist("denis","denisdiego")
## [1] 5
stringdist("denis","yanis")
## [1] 2

The distance is greater for the first case, because I need to add 5 characters to go from denis to denisdiego, while I only need to change two characters to switch from denis to yanis. My solution uses the qualV package. Here is the first function I did:

library(qualV)
comp_names <- function(a, b) {
  a <- clean_name(a) # cleaning"
  b <- clean_name(b)
  av <- strsplit(a, '')[[1]] # splitting all characters
  bv <- strsplit(b, '')[[1]] 
  match_str <- paste(qualV::LCS(av, bv)$LCS, collapse='')
  pc <- round(nchar(match_str)/(min(nchar(a), nchar(b)))*100)
  return(pc)
}

I clean the two names a and b, cut it in sequence of character, fin the longest common subsequence, and express it as a percentage of the smallest name. It outputs a similarity on a 100 scale:

comp_names("denis","denisdiego")
## [1] 100

denis and denisdiego are totally similar, while

comp_names("denis","yanis")
## [1] 60

denis and yanis are not so much the same.

Matrix of distances and clustering

I need to calculate the distance between all the entries of a vector, i.e. I need to compute a distance matrix. The distance between two names Tis just 100 - similarity I calculate before. Here is the function to generate the distance matrix:

comp_names_matrix <- function(vec) {
  mat <- sapply(vec, function(x) {
    100 - as.numeric(Vectorize(comp_names)(x, vec))
  })
  rownames(mat) <- vec
  return(mat)
}
comp_names_matrix(df$name)
##               denis denis diego paul sofia sophia, Elèna Paul dénis Paul 
## denis             0           0  100    80            60  100     0   100
## denis diego       0           0  100    60            70  100     0   100
## paul            100         100    0    75            25    0   100     0
## sofia            80          60   75     0            20   75    80    75
## sophia, Elèna    60          70   25    20             0   25    60    25
## Paul            100         100    0    75            25    0   100     0
## dénis             0           0  100    80            60  100     0   100
## Paul            100         100    0    75            25    0   100     0
## Etienne          60          57  100    80            57  100    60   100
## Paule            80          80    0    80            20    0    80     0
## sofia            80          60   75     0            20   75    80    75
##               Etienne Paule sofia
## denis              60    80    80
## denis diego        57    80    60
## paul              100     0    75
## sofia              80    80     0
## sophia, Elèna      57    20    20
## Paul              100     0    75
## dénis              60    80    80
## Paul              100     0    75
## Etienne             0    80    80
## Paule              80     0    80
## sofia              80    80     0

Now I can use the basic tools of clustering to see if I can identify the similar names:

df$name %>%
      comp_names_matrix(.) %>%
      stats::as.dist(.) %>%
      stats::hclust(.) %>% 
  plot 

Here it seems that my distance allow me to detect the similar person in a proper way. I need to find a threshold to produce an index indicating the names that are the same. I see in the dendrogram above that it lies between 20 and 40. The function to use is stats::cutree(h = thres). I write the following, that use the name comparison function described above and detect clusters of names:

compare_tree <- function(vec, thres = 30) {
  if(length(vec)>1){
  output <- vec %>%
    comp_names_matrix(.) %>%
    stats::as.dist(.) %>%
    stats::hclust(.) %>%  # cluster
    stats::cutree(h = thres)}else{output <- as.integer(1)}
  return(output)
}
compare_tree(df$name,30)
##         denis   denis diego          paul         sofia sophia, Elèna 
##             1             1             2             3             3 
##          Paul         dénis         Paul        Etienne         Paule 
##             2             1             2             4             2 
##         sofia 
##             3

It recognize the 4 names.

I can now apply this function on the names, surnames and birthdates too.

df[,name_ind := compare_tree(name,30)]
df[,surname_ind := compare_tree(surname,30)]
df[,birthdate_ind := compare_tree(birthdate,20)]

I define a person index for the common detected name, surname and birthdate:

df[,person := .GRP,by = .(birthdate_ind,surname_ind,name_ind)]

It worked. It is flexible enough to allow small mistakes in birthdate and can handle the fact that sometime there is just one name, sometime there are the 3 names of that same person.

If I do that on my entire COVID19 patient table, my computer turn into that:

I have around 70000 patients, which lead to 4.910^{9} distances to calculate. My solution is nice but cannot be used in a real life situation. I therefore need to constrain the degrees of freedom.

Less degrees of freedom

To avoid calculation of immense distance matrix, I can do grouping operation. What I came up with is to calculate the similarity of the names for person having identical cleaned surname and birthdate:

df[,idx1 := paste0("1_",.GRP,"_",compare_tree(name,30)),
   by = .(birthdate,surname_clean)]

The index is constructed as follow: it starts with 1, then it has the grouping index produced by .GRP, and then the index indicating similar names. This gives me a first index: person with same birthdate and surname, and similar names. I do the same for names and birthdate:

df[,idx2 := paste0("2_",.GRP,"_",compare_tree(surname,30)),
   by = .(name_clean,birthdate)]

and the last one:

df[,idx3 := paste0("3_",.GRP,"_",compare_tree(birthdate,20)),
   by = .(name_clean,surname_clean)]

You see that line 1 and 2 have the same idx1 index: same surname, same birthdate, and similar name detected by my function. You can see also that line 1 and 7 have the same idx3 index: same name, surname, and similar birthdates. I need to somehow reconnect these indices to merge them into one, indicating that the lines 1, 2 and 7 are the same person. The answer came from stack: we need to do graph analysis and identify clusters. We can define the lines as nodes, connected by the 3 indices. Nodes connected are the same person. To do that, I use the igraph library:

df[, rn := .I]
DT <- rbindlist(list(
  df[, .(s=idx1, e=idx2, rn)],
  df[, .(s=idx1, e=idx3, rn)],
  df[, .(s=idx2, e=idx3, rn)]))

df[, rn := .I] just create the row number rn (.I is a shortcut for the row number in data.table, really useful in some complex grouping operation). DT is the list of edges and vertices. Here is the graph:

g <- graph_from_data_frame(DT, directed=FALSE)
print(g)
## IGRAPH ea9b148 UN-- 22 33 -- 
## + attr: name (v/c), rn (e/n)
## + edges from ea9b148 (vertex names):
##  [1] 1_1_1--2_1_1 1_1_1--2_2_1 1_2_1--2_3_1 1_3_1--2_4_1 1_3_1--2_5_1
##  [6] 1_4_1--2_6_1 1_5_1--2_7_1 1_2_1--2_3_1 1_2_2--2_8_1 1_2_1--2_9_1
## [11] 1_3_1--2_4_1 1_1_1--3_1_1 1_1_1--3_2_1 1_2_1--3_3_1 1_3_1--3_4_1
## [16] 1_3_1--3_5_1 1_4_1--3_3_1 1_5_1--3_1_1 1_2_1--3_3_1 1_2_2--3_6_1
## [21] 1_2_1--3_7_1 1_3_1--3_4_1 2_1_1--3_1_1 2_2_1--3_2_1 2_3_1--3_3_1
## [26] 2_4_1--3_4_1 2_5_1--3_5_1 2_6_1--3_3_1 2_7_1--3_1_1 2_3_1--3_3_1
## [31] 2_8_1--3_6_1 2_9_1--3_7_1 2_4_1--3_4_1

I then detect clusters:

cl <- clusters(g)$membership
cl
## 1_1_1 1_2_1 1_3_1 1_4_1 1_5_1 1_2_2 2_1_1 2_2_1 2_3_1 2_4_1 2_5_1 2_6_1 2_7_1 
##     1     2     3     2     1     4     1     1     2     3     3     2     1 
## 2_8_1 2_9_1 3_1_1 3_2_1 3_3_1 3_4_1 3_5_1 3_6_1 3_7_1 
##     4     2     1     1     2     3     3     4     2
DT[, g := cl[s]]

And I merge back the found index in my initial data table, merging on the row number:

df[unique(DT, by="rn"), on=.(rn), person := g]

That’s it !

I put everything in a function:

Finding someone in a dataframe

Now, how can I find someone in my example data.frame ? Let us suppose I want to find these two persons in my example data frame:

tofind <- data.table(surname = c("mongin","Cruz, de la"),name = c("denis","Sophia"),birthdate = c("29-09-1986","02-03-2000"))
tofind
##        surname   name  birthdate
## 1:      mongin  denis 29-09-1986
## 2: Cruz, de la Sophia 02-03-2000

Well, the nice thing is that I can use the exact same function to do so. I first create a variable to indicate the lines and name of each data frame:

  tofind[,rn_df := .I]
  df[,rn_df := .I]
  
  tofind[,dfframe := "tosearch"]
  df[,dfframe := "database"]

and I bind the two together

dftot <- rbind(tofind,df[,.(name,surname,birthdate,rn_df,dfframe)])

Looking for someone is the same as detecting the duplicates in the binded data frame ! If I use my function to detect the person on dftot:

res <- detect_duplicate(dftot)

I can easily see which line comes from the data frame with people I wanted to find:

I can now easily extract the correspondence between the lines of each data frame:

dftot[,expand.grid(rn_tofind = rn_df[dfframe == "tosearch"],
                       rn_database = rn_df[dfframe == "database"]),
                     by = person]
##    person rn_tofind rn_database
## 1:      1         1           1
## 2:      1         1           2
## 3:      1         1           7
## 4:      2         2           4
## 5:      2         2           5
## 6:      2         2          11

The line 1 of my data frame with the people I was searching mongin, denis, 29-09-1986, 1, tosearch is found on line 1, 2 and 7 of my example data base, while the second person Cruz, de la, Sophia, 02-03-2000, 2, tosearch is on line 4, 5 and 11 of my data base.

Conclusion

The function here is a simplified version of the one we use, but it contains all the core function that makes it efficiently works. It became a central part of the COVID data handling in Geneva, as we use it to update all the data of our data base directly from R: test results, deaths, hospitalizations, contacts etc.

Avatar
Denis Mongin
Physicist, Data scientist

Related