Blogs

The Record Linkage Problem

Kenny Darrell

April 10, 2016

The record linkage problem is a nasty beast that pops its head up when you have disperate data sources. If you have ever encountered it yourself then you have shared in my frustrations. It can be a real pain. In all analytics work you get side tracked with cleansing issues, but this one is a real pain becuase it often requires its own modeling process. This can then carry some uncertaininty forward into all of the downstream analysis. As I work with it here and there I want to persist some thoughts, definitions and a few code snippets in this post.

The problem goes by many different names and depending on the level of rigor can have many phases.

AKA

Some ground can be made by defining some of the terms in this area.

Entity - A real world object. This can be a person, place or anything else. It does not need to be a physichal thing.

Attributes - An entity has various types of data that can be attached to it. A person has a first name a last and a birth date.

Reference - Data contains observations that are filled with attributes that refer to real world entities.

When we create a database we spend some time ensuring that we know what entities exist and we clearly define them and have the ability to add a uniqness constraint. Then we create a key for each entity and then istead of ever using the enitity we use the key so we are using a reference to the entity. When data comes from disperate sources we no longer have a key that clearly and uniquely shows the reference entity. There is more than just one challange here.

Deduplication - Remove redundent references to the same entity in one data source. This normalization will mean that we have a unique set of entities. This was mentioned previosly as the part of creating a key in a database.

Canonicalization - This creates the most complete record. This is similar to deduplication except that the duplicate versions may have more or less information than others, and the goal is to create the most full form of the entity.

Record Linkage - When we have two different sources of of deduplicated and canonicalized data sets and we need determine which observations in one set should be linked to that in another.

Disambiguation or Referenceing - To match a noisy set to a deduped set. This can also add extra information to contribute to a more complete reference entity.

This source has more information all on all of these but the highlight is the that it provides this great image.

In data science the problem is more often called entity resolution. This term can mean many things. This source lays out this larger overall process. I have often thought that all of this was more Master Data Management.

  1. Entity Reference Extraction
  2. Entity Reference Preperation
  3. Entity Reference Resolution
  4. Entity Identity Management
  5. Entity Relationship Analysis

The first one is often related to text mining, where you have to extrat a reference to an entity from some form of free text. If you work with disperate structured data sets this one may be less common. The next is the data preperation phase. This will be tasks similar to the above deduplication and canonicalization. The entity reference phase is the part that is commonly called record linkage, where we match references that share similar attributes. What we hope to do is find to references to the same entity and link them. The next phase is the identity management, this is where we resolve cand mantian the real world entites as encoded data.

One item to note is identity resolution is different than entity resolution.

We can determine if two sets of fingerprints are for the same or different suspect without ever knowing the identity. Thus we can say that two crimes scenes are the same criminal or not but have no idea who the criminal really is. This is entity resolution.

If we have a fingerprint from a crime scene and get a hit in a database of previosly incarcerated entities then we know the identity. This is identity resolution.

How we resolve the entites is based on Linking which can be done in a few different ways:

Direct matching means to compare attributes and can be done via

Transitive Linking is similar to the mathematical notion, if A links to B and B links to C, then transitive linking implies that A also links to C.

Association is when we have multiple types of entities, say people and houses. If we know only one person lives at a house we can say that two sources with people and there homes, that if the home is resolved and linked then the people attached to those houses are linked as well.

Assertion is done by a person and can be also be called knowldege based.

Some other terms you will here.

No matter which problem we are solving we have a few basic parameters to keep in mind.

R - The number of records M - The set of matches N - The set of non-matches E - The set of entities L - The set of links

We also need to recognize what problem we are solving. If we are doing record linkage it really means we have two deduplicated and canonical sources so we could at most have on link per entity, but we may have many matches. In other problems such as deduplication we may have many mancy valid matches. It is common though that we have few real links, this is because for any set of data we have some number of observations in one set, call it A and a similar count in the second, call it B. We have A * B possible matches, but most should nto be matched. We can take the true matches as true positives and similar for each of the other supervised learnign outcomes, false postive, false negative and true negatives and create all the same performance measures.

Example

options(stringsAsFactors = FALSE)

# Load libraries
library(httr)
library(XML)
library(dplyr)
library(rvest)
library(lubridate)
library(purrr)
library(fuzzyjoin)

This gets a full set of all NBA players. We can consider this to be a full set of identities. It has already been deduplicated, but it won’t be fully canonicalized.

'http://www.basketball-reference.com/players/' %>%
  paste0(letters[-24]) %>%
  map_df(~readHTMLTable(.)[[1]]) -> players
head(players)
## Source: local data frame [6 x 8]
## 
##                 Player  From    To   Pos    Ht    Wt        Birth Date
##                  (chr) (chr) (chr) (chr) (chr) (chr)             (chr)
## 1       Alaa Abdelnaby  1991  1995   F-C  6-10   240     June 24, 1968
## 2      Zaid Abdul-Aziz  1969  1978   C-F   6-9   235     April 7, 1946
## 3 Kareem Abdul-Jabbar*  1970  1989     C   7-2   225    April 16, 1947
## 4   Mahmoud Abdul-Rauf  1991  2001     G   6-1   162     March 9, 1969
## 5    Tariq Abdul-Wahad  1998  2003     F   6-6   223  November 3, 1974
## 6  Shareef Abdur-Rahim  1997  2008     F   6-9   225 December 11, 1976
## Variables not shown: College (chr)

Now we need to do some data preperation on the table to make it more usable. I am also going to remove a dew fields that we are not going to use, but in reality you would keep these for complete canonicalization.

players %>% 
  mutate(From = as.numeric(From),
         To = as.numeric(To)) %>%
  rename(name = Player) %>%
  select(-`Birth Date`, -College, -Ht, -Wt) -> players

# This denotes players that are in the Hall of Fame.
players$name <- gsub('*', '', players$name, fixed = T)
head(players)
## Source: local data frame [6 x 4]
## 
##                  name  From    To   Pos
##                 (chr) (dbl) (dbl) (chr)
## 1      Alaa Abdelnaby  1991  1995   F-C
## 2     Zaid Abdul-Aziz  1969  1978   C-F
## 3 Kareem Abdul-Jabbar  1970  1989     C
## 4  Mahmoud Abdul-Rauf  1991  2001     G
## 5   Tariq Abdul-Wahad  1998  2003     F
## 6 Shareef Abdur-Rahim  1997  2008     F

Now we need to aqcuire a dataset to match against.

'http://espn.go.com/nba/boxscore?gameId=290324004' %>%
  read_html %>%
  html_nodes('table') %>% 
  html_nodes('.name') %>% 
  html_text -> game
head(game)
## [1] "starters"     "A. McDyessPF" "T. PrinceSF"  "K. BrownC"   
## [5] "R. StuckeyPG" "A. AfflaloSG"

We obviously have to do some data preperation on this as well! This is really turning the list into a data.frame and removing things that are not names and rows that are really just spaces.

game %>% 
  data.frame(game = .) %>% 
  filter(!game %in% c('starters', 'bench', 'TEAM')) %>%
  filter(nchar(game) > 1) -> game
head(game)
##           game
## 1 A. McDyessPF
## 2  T. PrinceSF
## 3    K. BrownC
## 4 R. StuckeyPG
## 5 A. AfflaloSG
## 6  W. SharpePF

We need to seperate all of the text into its own fields. We must also create fields that exist in the set we are trying to match against.

game$f_init <- sapply(strsplit(game$game, '. ', fixed = T), `[[`, 1)
game$name <- sapply(strsplit(game$game, '. ', fixed = T), `[[`, 2)

game$pos <- NA

for (i in c('PG$', 'SG$', 'G$', 'PF$', 'SF$', 'C$')) {
  game$pos <- ifelse(grepl(i, game$name), i, game$pos)
  game$name <- gsub(i, '', game$name)
}

game$pos <- gsub('$', '', game$pos, fixed = T)

game %>% mutate(year = 2009) %>% 
  select(-game) %>% 
  rename(last = name) -> game
head(game)
##   f_init    last pos year
## 1      A McDyess  PF 2009
## 2      T  Prince  SF 2009
## 3      K   Brown   C 2009
## 4      R Stuckey  PG 2009
## 5      A Afflalo  SG 2009
## 6      W  Sharpe  PF 2009

Now we need to add to the original data the fields which are needed to match against.

# 'last, first' -> 'last', 'first'
players$first <- sapply(strsplit(players$name, ' ', fixed = T), `[[`, 1)
players$last <- sapply(strsplit(players$name, ' ', fixed = T), `[[`, 2)
players$f_init <- substr(players$first, 1, 1)
head(players)
## Source: local data frame [6 x 7]
## 
##                  name  From    To   Pos   first         last f_init
##                 (chr) (dbl) (dbl) (chr)   (chr)        (chr)  (chr)
## 1      Alaa Abdelnaby  1991  1995   F-C    Alaa    Abdelnaby      A
## 2     Zaid Abdul-Aziz  1969  1978   C-F    Zaid   Abdul-Aziz      Z
## 3 Kareem Abdul-Jabbar  1970  1989     C  Kareem Abdul-Jabbar      K
## 4  Mahmoud Abdul-Rauf  1991  2001     G Mahmoud   Abdul-Rauf      M
## 5   Tariq Abdul-Wahad  1998  2003     F   Tariq  Abdul-Wahad      T
## 6 Shareef Abdur-Rahim  1997  2008     F Shareef  Abdur-Rahim      S

Now we can try to resolve who played in this game against all NBA players. Is this entity resolution or identity resolution? Since we considered the set of all NBA players as the reference list this is more of an identity resolution problem, we need to match each observation to the real reference entity. This also adds a constraint that we cannot create multiple links for any refrence entity.

m <- players %>% select(name, Pos, last, first, f_init, From, To)

game %>% left_join(m) %>% as.data.frame()
## Joining by: c("f_init", "last")
##    f_init     last pos year              name  Pos     first From   To
## 1       A  McDyess  PF 2009   Antonio McDyess  F-C   Antonio 1996 2011
## 2       T   Prince  SF 2009   Tayshaun Prince    F  Tayshaun 2003 2016
## 3       K    Brown   C 2009     Kedrick Brown    G   Kedrick 2002 2005
## 4       K    Brown   C 2009       Kwame Brown    F     Kwame 2002 2013
## 5       R  Stuckey  PG 2009    Rodney Stuckey    G    Rodney 2008 2016
## 6       A  Afflalo  SG 2009     Arron Afflalo    G     Arron 2008 2016
## 7       W   Sharpe  PF 2009     Walter Sharpe    F    Walter 2009 2009
## 8       A  Johnson  PF 2009 Alexander Johnson    F Alexander 2007 2008
## 9       A  Johnson  PF 2009      Amir Johnson    F      Amir 2006 2016
## 10      A  Johnson  PF 2009      Andy Johnson  F-G      Andy 1959 1962
## 11      A  Johnson  PF 2009   Anthony Johnson    G   Anthony 1998 2010
## 12      A  Johnson  PF 2009     Armon Johnson    G     Armon 2011 2012
## 13      A  Johnson  PF 2009     Arnie Johnson  F-C     Arnie 1949 1953
## 14      A  Johnson  PF 2009     Avery Johnson    G     Avery 1989 2004
## 15      J  Maxiell  PF 2009     Jason Maxiell    F     Jason 2006 2015
## 16      W Heinrich  PF 2009              <NA> <NA>      <NA>   NA   NA
## 17      R  Wallace  PF 2009   Rasheed Wallace  F-C   Rasheed 1996 2013
## 18      R  Wallace  PF 2009       Red Wallace    G       Red 1947 1947
## 19      W    Bynum  PG 2009        Will Bynum    G      Will 2006 2015
## 20      R Hamilton  SG 2009    Ralph Hamilton  G-F     Ralph 1949 1949
## 21      R Hamilton  SG 2009  Richard Hamilton  G-F   Richard 2000 2013
## 22      R Hamilton  SG 2009      Roy Hamilton    G       Roy 1980 1981
## 23      A  Iverson  SG 2009     Allen Iverson    G     Allen 1997 2010
## 24      T   Thomas  PF 2009      Terry Thomas    F     Terry 1976 1976
## 25      T   Thomas  PF 2009        Tim Thomas    F       Tim 1998 2010
## 26      T   Thomas  PF 2009      Tyrus Thomas    F     Tyrus 2007 2015
## 27      J  Salmons  SF 2009      John Salmons    G      John 2003 2015
## 28      J     Noah   C 2009       Joakim Noah    C    Joakim 2008 2016
## 29      K  Hinrich  SG 2009      Kirk Hinrich    G      Kirk 2004 2016
## 30      B   Gordon  SG 2009        Ben Gordon    G       Ben 2005 2015
## 31      L  Johnson  PF 2009     Larry Johnson    G     Larry 1978 1978
## 32      L  Johnson  PF 2009     Larry Johnson    F     Larry 1992 2001
## 33      L  Johnson  PF 2009       Lee Johnson    F       Lee 1981 1981
## 34      L  Johnson  PF 2009    Linton Johnson    F    Linton 2004 2009
## 35      T   Thomas  PF 2009      Terry Thomas    F     Terry 1976 1976
## 36      T   Thomas  PF 2009        Tim Thomas    F       Tim 1998 2010
## 37      T   Thomas  PF 2009      Tyrus Thomas    F     Tyrus 2007 2015
## 38      L     Deng  SF 2009         Luol Deng    F      Luol 2005 2016
## 39      A     Gray   C 2009        Aaron Gray    C     Aaron 2008 2014
## 40      J    James   C 2009      Jerome James    C    Jerome 1999 2009
## 41      B   Miller   C 2009       Bill Miller    F      Bill 1949 1949
## 42      B   Miller   C 2009        Bob Miller    F       Bob 1984 1984
## 43      B   Miller   C 2009       Brad Miller    C      Brad 1999 2012
## 44      D     Rose  PG 2009      Derrick Rose    G   Derrick 2009 2016
## 45      A Roberson  SG 2009    Andre Roberson  G-F     Andre 2014 2016
## 46      A Roberson  SG 2009  Anthony Roberson    G   Anthony 2006 2009
## 47      L   Hunter  SG 2009        Les Hunter  F-C       Les 1965 1973
## 48      L   Hunter  SG 2009    Lindsey Hunter    G   Lindsey 1994 2010

We are getting back more players than we would expect. This means we have more matches than links that we need to create. Or in data science speek we have some things in M, the set of matches that should be in N, the non-matches, so a few false positves. We can take additional steps to remove these flase positves, but this can also contribute to more false negatives. We can see that we are not fully utilizing all of the information. Arnie Johnson played from 1949 to 1953, so it is not a valid match for this game. We can utilize the time of the game alongside the career window of a player to help reduce the false positives. Since we considered this to be Identity resoloution becuase the the first list is the authoriative list of NBA players. We also know that we can only have one player from the game that should be matched to this list. This is not a canonicalization problem where we may have the same entity many times with different information that we want to pull together. We also know that we need a link for each.

m <- players %>% filter(From <= 2009, To >= 2009) %>%
  select(name, Pos, last, first, f_init)

game %>% left_join(m) %>% as.data.frame()
## Joining by: c("f_init", "last")
##    f_init     last pos year             name  Pos    first
## 1       A  McDyess  PF 2009  Antonio McDyess  F-C  Antonio
## 2       T   Prince  SF 2009  Tayshaun Prince    F Tayshaun
## 3       K    Brown   C 2009      Kwame Brown    F    Kwame
## 4       R  Stuckey  PG 2009   Rodney Stuckey    G   Rodney
## 5       A  Afflalo  SG 2009    Arron Afflalo    G    Arron
## 6       W   Sharpe  PF 2009    Walter Sharpe    F   Walter
## 7       A  Johnson  PF 2009     Amir Johnson    F     Amir
## 8       A  Johnson  PF 2009  Anthony Johnson    G  Anthony
## 9       J  Maxiell  PF 2009    Jason Maxiell    F    Jason
## 10      W Heinrich  PF 2009             <NA> <NA>     <NA>
## 11      R  Wallace  PF 2009  Rasheed Wallace  F-C  Rasheed
## 12      W    Bynum  PG 2009       Will Bynum    G     Will
## 13      R Hamilton  SG 2009 Richard Hamilton  G-F  Richard
## 14      A  Iverson  SG 2009    Allen Iverson    G    Allen
## 15      T   Thomas  PF 2009       Tim Thomas    F      Tim
## 16      T   Thomas  PF 2009     Tyrus Thomas    F    Tyrus
## 17      J  Salmons  SF 2009     John Salmons    G     John
## 18      J     Noah   C 2009      Joakim Noah    C   Joakim
## 19      K  Hinrich  SG 2009     Kirk Hinrich    G     Kirk
## 20      B   Gordon  SG 2009       Ben Gordon    G      Ben
## 21      L  Johnson  PF 2009   Linton Johnson    F   Linton
## 22      T   Thomas  PF 2009       Tim Thomas    F      Tim
## 23      T   Thomas  PF 2009     Tyrus Thomas    F    Tyrus
## 24      L     Deng  SF 2009        Luol Deng    F     Luol
## 25      A     Gray   C 2009       Aaron Gray    C    Aaron
## 26      J    James   C 2009     Jerome James    C   Jerome
## 27      B   Miller   C 2009      Brad Miller    C     Brad
## 28      D     Rose  PG 2009     Derrick Rose    G  Derrick
## 29      A Roberson  SG 2009 Anthony Roberson    G  Anthony
## 30      L   Hunter  SG 2009   Lindsey Hunter    G  Lindsey

This helped clear away a lot of the false matches. Since this was done in a deterministic way there is no way to take the most likely, high proabability match. It was rule driven so we needed to refine the rules.

One thing we see here is that we have a few matches that occur many times. This is becuase our input was not unique.

game[duplicated(game), ]
##    f_init   last pos year
## 20      T Thomas  PF 2009

So this guy appears twice. Really one is named Tim and the other Tyrus. There is no way to resolve this given this data frame. We could add further info if we had further sets of data that denote which team each person played for or even the height and weight. We have position here, which in other cases could work but these players are both point gaurds.

game %>% left_join(m) %>% as.data.frame() %>% distinct
## Joining by: c("f_init", "last")
##    f_init     last pos year             name  Pos    first
## 1       A  McDyess  PF 2009  Antonio McDyess  F-C  Antonio
## 2       T   Prince  SF 2009  Tayshaun Prince    F Tayshaun
## 3       K    Brown   C 2009      Kwame Brown    F    Kwame
## 4       R  Stuckey  PG 2009   Rodney Stuckey    G   Rodney
## 5       A  Afflalo  SG 2009    Arron Afflalo    G    Arron
## 6       W   Sharpe  PF 2009    Walter Sharpe    F   Walter
## 7       A  Johnson  PF 2009     Amir Johnson    F     Amir
## 8       A  Johnson  PF 2009  Anthony Johnson    G  Anthony
## 9       J  Maxiell  PF 2009    Jason Maxiell    F    Jason
## 10      W Heinrich  PF 2009             <NA> <NA>     <NA>
## 11      R  Wallace  PF 2009  Rasheed Wallace  F-C  Rasheed
## 12      W    Bynum  PG 2009       Will Bynum    G     Will
## 13      R Hamilton  SG 2009 Richard Hamilton  G-F  Richard
## 14      A  Iverson  SG 2009    Allen Iverson    G    Allen
## 15      T   Thomas  PF 2009       Tim Thomas    F      Tim
## 16      T   Thomas  PF 2009     Tyrus Thomas    F    Tyrus
## 17      J  Salmons  SF 2009     John Salmons    G     John
## 18      J     Noah   C 2009      Joakim Noah    C   Joakim
## 19      K  Hinrich  SG 2009     Kirk Hinrich    G     Kirk
## 20      B   Gordon  SG 2009       Ben Gordon    G      Ben
## 21      L  Johnson  PF 2009   Linton Johnson    F   Linton
## 22      L     Deng  SF 2009        Luol Deng    F     Luol
## 23      A     Gray   C 2009       Aaron Gray    C    Aaron
## 24      J    James   C 2009     Jerome James    C   Jerome
## 25      B   Miller   C 2009      Brad Miller    C     Brad
## 26      D     Rose  PG 2009     Derrick Rose    G  Derrick
## 27      A Roberson  SG 2009 Anthony Roberson    G  Anthony
## 28      L   Hunter  SG 2009   Lindsey Hunter    G  Lindsey

There is another similar collision with A Johnson, but this one we could use the position to resolve. We also have another that gets no match. This is because a differnt name is used. We can see all sorts of problems here. We can obviously do some things to make this work in this case but in the reality of the problem we would see new sets of data coming in that we may have similar issues, but nobody to use Wikipedia or ESPN to figure out what to do. This is the point that we think about probabilistic links. We need to qualify the strength of a match then link the highest. This will have errors, but all data science has errors. All models are wrong, but some are useful!

get_pl <- function(game, year) {
  game %>%
    paste0('http://espn.go.com/nba/boxscore?gameId=', .) %>%
    read_html %>%
    html_nodes('table') %>% 
    html_nodes('.name') %>% 
    html_text %>% 
    data.frame(game = .) %>% 
    filter(!game %in% c('starters', 'bench', 'TEAM')) %>%
    filter(nchar(game) > 1) -> game
  
  
  game$f_init <- sapply(strsplit(game$game, '. ', fixed = T), `[[`, 1)
  game$name <- sapply(strsplit(game$game, '. ', fixed = T), `[[`, 2)
  
  game$pos <- NA
  
  for (i in c('PG$', 'SG$', 'G$', 'PF$', 'SF$', 'C$')) {
    game$pos <- ifelse(grepl(i, game$name), i, game$pos)
    game$name <- gsub(i, '', game$name)
  }
  
  game$pos <- gsub('$', '', game$pos, fixed = T)
  
  game %>% mutate(year = year) %>% select(-game) %>% rename(last = name)
}
x1 <- get_pl('400829015', 2016)

m <- players %>% filter(From <= 2016, To >= 2016) %>%
  select(name, last, first, f_init)

x1 %>% left_join(m, by = c('last', 'f_init')) %>% as.data.frame()
##    f_init         last pos year             name     first
## 1       C       Landry  PF 2016      Carl Landry      Carl
## 2       J        Grant  SF 2016     Jerami Grant    Jerami
## 3       J        Grant  SF 2016     Jerian Grant    Jerian
## 4       I       Canaan  PG 2016    Isaiah Canaan    Isaiah
## 5       I        Smith  PG 2016        Ish Smith       Ish
## 6       H     Thompson  SG 2016  Hollis Thompson    Hollis
## 7       E        Brand  PF 2016      Elton Brand     Elton
## 8       R    Covington  SF 2016 Robert Covington    Robert
## 9       K     Marshall  PG 2016 Kendall Marshall   Kendall
## 10    T.J    McConnell  PG 2016             <NA>      <NA>
## 11      N     Stauskas  SG 2016     Nik Stauskas       Nik
## 12      N         Noel  PF 2016     Nerlens Noel   Nerlens
## 13      R       Holmes  PF 2016   Richaun Holmes   Richaun
## 14      C         Wood  PF 2016   Christian Wood Christian
## 15      M     Williams  PF 2016  Marvin Williams    Marvin
## 16      M     Williams  PF 2016      Mo Williams        Mo
## 17      C       Zeller   C 2016      Cody Zeller      Cody
## 18      K       Walker  PG 2016     Kemba Walker     Kemba
## 19      N        Batum  SG 2016    Nicolas Batum   Nicolas
## 20      C          Lee  SG 2016     Courtney Lee  Courtney
## 21      F Kaminsky III   C 2016             <NA>      <NA>
## 22      A    Jefferson   C 2016     Al Jefferson        Al
## 23      J          Lin  PG 2016       Jeremy Lin    Jeremy
## 24      J         Lamb  SG 2016      Jeremy Lamb    Jeremy
## 25      S        Hawes  PF 2016    Spencer Hawes   Spencer
## 26      T   Hansbrough  PF 2016 Tyler Hansbrough     Tyler
## 27      J    Gutierrez  PG 2016  Jorge Gutierrez     Jorge
## 28      T      Daniels  SG 2016     Troy Daniels      Troy

The fuzzyjoin package can help some here but still needs a few features to get us all the way to where we need to be.

x1 %>%
  stringdist_left_join(m, max_dist = 1, by = c('last', 'f_init')) %>% 
  select(-year, -name) %>%
  as.data.frame()
##    f_init.x       last.x pos     last.y     first f_init.y
## 1         C       Landry  PF     Landry      Carl        C
## 2         J        Grant  SF      Grant    Jerami        J
## 3         J        Grant  SF      Grant    Jerian        J
## 4         I       Canaan  PG     Canaan    Isaiah        I
## 5         I        Smith  PG      Smith      Greg        G
## 6         I        Smith  PG      Smith       Ish        I
## 7         I        Smith  PG      Smith      J.R.        J
## 8         I        Smith  PG      Smith     Jason        J
## 9         I        Smith  PG      Smith      Josh        J
## 10        I        Smith  PG      Smith      Russ        R
## 11        H     Thompson  SG   Thompson    Hollis        H
## 12        H     Thompson  SG   Thompson     Jason        J
## 13        H     Thompson  SG   Thompson      Klay        K
## 14        H     Thompson  SG   Thompson   Tristan        T
## 15        E        Brand  PF      Brand     Elton        E
## 16        R    Covington  SF  Covington    Robert        R
## 17        K     Marshall  PG   Marshall   Kendall        K
## 18      T.J    McConnell  PG       <NA>      <NA>     <NA>
## 19        N     Stauskas  SG   Stauskas       Nik        N
## 20        N         Noel  PF       Noel   Nerlens        N
## 21        R       Holmes  PF     Holmes   Richaun        R
## 22        C         Wood  PF       Hood    Rodney        R
## 23        C         Wood  PF       Wood Christian        C
## 24        M     Williams  PF   Williams      Alan        A
## 25        M     Williams  PF   Williams     Deron        D
## 26        M     Williams  PF   Williams   Derrick        D
## 27        M     Williams  PF   Williams    Elliot        E
## 28        M     Williams  PF   Williams       Lou        L
## 29        M     Williams  PF   Williams    Marvin        M
## 30        M     Williams  PF   Williams        Mo        M
## 31        C       Zeller   C     Zeller      Cody        C
## 32        C       Zeller   C     Zeller     Tyler        T
## 33        K       Walker  PG     Walker     Kemba        K
## 34        N        Batum  SG      Batum   Nicolas        N
## 35        C          Lee  SG        Gee    Alonzo        A
## 36        C          Lee  SG        Lee  Courtney        C
## 37        C          Lee  SG        Lee     David        D
## 38        C          Lee  SG        Len      Alex        A
## 39        F Kaminsky III   C       <NA>      <NA>     <NA>
## 40        A    Jefferson   C  Jefferson        Al        A
## 41        A    Jefferson   C  Jefferson      Cory        C
## 42        A    Jefferson   C  Jefferson   Richard        R
## 43        J          Lin  PG        Len      Alex        A
## 44        J          Lin  PG        Lin    Jeremy        J
## 45        J         Lamb  SG       Lamb    Jeremy        J
## 46        S        Hawes  PF      Hawes   Spencer        S
## 47        S        Hawes  PF      Hayes     Chuck        C
## 48        T   Hansbrough  PF Hansbrough     Tyler        T
## 49        J    Gutierrez  PG  Gutierrez     Jorge        J
## 50        T      Daniels  SG    Daniels      Troy        T

We get a ton of hits but this is do to the distance allowed being one. If we set it to less than one it turns into an exact match. In this case what we need is a way to have a different distance on each field to join on, or even a hiearchy of how we match on differnt fields. Then we need another method that actually uses the distance between two that we can use to create a link from all of the matches.

So in no way did I entirely solve this issue, but that was not really my goal. The goal was more to highlight some of the challanges and possible steps to solve them as well as getting aquanted with the intricacies of the problem.