在加入Dplyr之前,使用RecordLinkage匹配两个数据帧中的不等名称列

时间:2016-06-13 02:51:04

标签: r string dplyr

以下是我想使用Dplyr通过名字和姓氏加入的两个数据帧。

full_join(Names1,Names2,by=c("FirstNames",LastNames")

但是,拼写和某些名称的格式存在差异。我想使用RecordLinkage(或adist)匹配相似的名称,然后覆盖名称,以便在加入之前它们在两个数据帧中匹配。 (我也意识到&#34; FirstNames&#34;和#34; LastNames&#34;列在两个数据帧中都有不同的标题......因此必须更改。)< / p>

FirstNames<-c("Chris","Shintaro","Doug","Elsa","Bubbles","Kelly","Christine")
LastNames<-c("MacDougall","Yamazaki","Shapiro","Elizabeth Ray","Murphy","Anderson","Yamaguchi")
Pets<-c("Cat","Dog","Cat","Dog","Cat","Snake","Eagle")
Names1<-data.frame(FirstNames,LastNames,Pets)

FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)

以下是我在RecordLinkage包中使用的一些步骤,但我遇到了一个&#34;错误:所有select()输入必须解析为整数列位置。&#34;这是因为数据帧的行数不同吗?

Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 3)
PairsSelect <- 
Results$pairs %>% 
select(firstNameSim = FirstNames, lastNameSim = LastNames)

我想继续使用上面的PairsSelect数据连接名称列的代码,以便我可以在名字和姓氏列中找到匹配项,然后覆盖两个数据框,以便每个数据框具有相同的名称拼写和格式加入之前。我不知道如何最好地处理一个数据帧(Names1)包含的行数多于其他数据帧(Names2)的事实。

任何关于如何向前迈进以实现这一目标的指导将不胜感激!

1 个答案:

答案 0 :(得分:0)

我将密钥与进入'compare.linkage'函数的数据分开,这与数据库世界观有点狡猾,但只要你在加入之前不重新排列相对于df的名称compare.linkage()流回所有内容将保持正确。

library(dplyr)
library(RecordLinkage)

FirstNames<-c("Chris","Shintaro","Doug","Elsa","Bubbles","Kelly","Christine","Thomas","George")
LastNames<-c("MacDougall","Yamazaki","Shapiro","Elizabeth Ray","Murphy","Anderson","Yamaguchi","DelTorre","Fibonacci")
Pets<-c("Cat","Dog","Cat","Dog","Cat","Snake","Eagle","Shark","Parrot")
Names1<-data.frame(FirstNames,LastNames,Pets,FullNames = paste0(FirstNames, " ", LastNames))

FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa","George")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth","Fibos")
Dwelling<-c("House","House","Apartment","Condo","House","Camper")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling,FullNames2 = paste0(FirstNames2, " ", LastNames2))

df1 <- mutate(id1 = row_number(), Names1)
df2 <- mutate(id2 = row_number(), Names2)

Results <- compare.linkage(Names1, Names2, strcmp = T, exclude = c(1:3))

BestResults <- 
    Results$pairs %>% 
    arrange(desc(FullNames))

BestResultsJoined<- 
    full_join(BestResults,df1,by="id1")%>% 
    full_join(df2,by="id2")%>% 
    select(FullNamesSim=FullNames.x,FullNames1=FullNames.y,FullNames2,id1,id2)
BestResultsJoined

这会为您提供最接近匹配的最近匹配的排序列表。看起来你不想考虑任何相似度低于0.95的东西(通过Jaro-Winkler距离。)