以下是我想使用Dplyr通过名字和姓氏加入的两个数据帧。
full_join(Names1,Names2,by=c("FirstNames",LastNames")
但是,拼写和某些名称的格式存在差异。我想使用RecordLinkage(或adist)匹配相似的名称,然后覆盖名称,以便在加入之前它们在两个数据帧中匹配。 (我也意识到&#34; FirstNames&#34;和#34; LastNames&#34;列在两个数据帧中都有不同的标题......因此必须更改。)< / p>
FirstNames<-c("Chris","Shintaro","Doug","Elsa","Bubbles","Kelly","Christine")
LastNames<-c("MacDougall","Yamazaki","Shapiro","Elizabeth Ray","Murphy","Anderson","Yamaguchi")
Pets<-c("Cat","Dog","Cat","Dog","Cat","Snake","Eagle")
Names1<-data.frame(FirstNames,LastNames,Pets)
FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)
以下是我在RecordLinkage包中使用的一些步骤,但我遇到了一个&#34;错误:所有select()输入必须解析为整数列位置。&#34;这是因为数据帧的行数不同吗?
Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 3)
PairsSelect <- Results$pairs %>% select(firstNameSim = FirstNames, lastNameSim = LastNames)
我想继续使用上面的PairsSelect数据连接名称列的代码,以便我可以在名字和姓氏列中找到匹配项,然后覆盖两个数据框,以便每个数据框具有相同的名称拼写和格式加入之前。我不知道如何最好地处理一个数据帧(Names1)包含的行数多于其他数据帧(Names2)的事实。
任何关于如何向前迈进以实现这一目标的指导将不胜感激!
答案 0 :(得分:0)
我将密钥与进入'compare.linkage'函数的数据分开,这与数据库世界观有点狡猾,但只要你在加入之前不重新排列相对于df的名称compare.linkage()
流回所有内容将保持正确。
library(dplyr)
library(RecordLinkage)
FirstNames<-c("Chris","Shintaro","Doug","Elsa","Bubbles","Kelly","Christine","Thomas","George")
LastNames<-c("MacDougall","Yamazaki","Shapiro","Elizabeth Ray","Murphy","Anderson","Yamaguchi","DelTorre","Fibonacci")
Pets<-c("Cat","Dog","Cat","Dog","Cat","Snake","Eagle","Shark","Parrot")
Names1<-data.frame(FirstNames,LastNames,Pets,FullNames = paste0(FirstNames, " ", LastNames))
FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa","George")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth","Fibos")
Dwelling<-c("House","House","Apartment","Condo","House","Camper")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling,FullNames2 = paste0(FirstNames2, " ", LastNames2))
df1 <- mutate(id1 = row_number(), Names1)
df2 <- mutate(id2 = row_number(), Names2)
Results <- compare.linkage(Names1, Names2, strcmp = T, exclude = c(1:3))
BestResults <-
Results$pairs %>%
arrange(desc(FullNames))
BestResultsJoined<-
full_join(BestResults,df1,by="id1")%>%
full_join(df2,by="id2")%>%
select(FullNamesSim=FullNames.x,FullNames1=FullNames.y,FullNames2,id1,id2)
BestResultsJoined
这会为您提供最接近匹配的最近匹配的排序列表。看起来你不想考虑任何相似度低于0.95的东西(通过Jaro-Winkler距离。)