我想知道在dplyr中进行连接之前是否可以轻松比较列。以下是两个简单的数据帧。我想根据名字和姓氏进行完全加入,但是有一些拼写错误或不同的格式,例如" Elizabeth Ray" vs"伊丽莎白"。
我想在加入之前比较这些列。我希望有一种方法可以产生所有与索引差异的列表或向量,这样我就可以在加入之前纠正它们。
如果有更简单的方法,我也会对此持开放态度,但我希望能采用最简单的方法。我想要一个基于dplyr,tidyr和stringr的解决方案。
FirstNames<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames<-c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
Pets<-c("Cat","Dog","Cat","Dog","Cat")
Names1<-data.frame(FirstNames,LastNames,Pets)
FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)
答案 0 :(得分:0)
我正在回答,因为我无法访问评论
df = Names1[!(Names1$LastNames %in% Names2$LastNames2), ]
尝试一下代码。
答案 1 :(得分:0)
为了比较您的记录之间的相似性,我认为您可能正在寻找一种方法来将一定量的模糊逻辑匹配应用于您的名称比较任务。 AKA:应用String Distance Function 执行Record Linkage任务时。 (如果你已经知道这一切,请原谅我 - 但这些关键词在开始时对我有很大的帮助。)
有一个名为stringdist
的优秀软件包适用于这些应用程序,但recordlinkage可能会帮助您最快地完成对齐数据框架的工作。
如果您希望查看姓氏和姓氏最相似的值,请使用以下代码:
library(RecordLinkage)
library(dplyr)
id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)
id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)
# RecordLinkage function that calculates string distance b/w records in two data frames
Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 4)
Results
# $data1
# firstName lastName pet
# 1 Chris MacDougall Cat
# 2 Doug Shapiro Dog
# 3 Shintaro Yamazaki Cat
# 4 Bubbles Murphy Dog
# 5 Elsa Elizabeth Ray Cat
# $data2
# firstName2 lastName2 dwelling
# 1 Chris MacDougal House
# 2 Doug Shapiro House
# 3 Shintaro Yamazaku Apartment
# 4 Bubbles Murphy Condo
# 5 Elsa Elizabeth House
# $pairs
# id1 id2 id firstName lastName is_match
# 1 1 1 1 1 0.9800000 NA
# 2 2 2 1 1 1.0000000 NA
# 3 3 3 1 1 0.9500000 NA
# 4 4 4 1 1 1.0000000 NA
# 5 5 5 1 1 0.9384615 NA
# $frequencies
# id firstName lastName
# 0.200 0.200 0.125
# $type
# [1] "linkage"
# attr(,"class")
# [1] "RecLinkData"
# Trim $pairs dataframe (seen above) to contain just id's & similarity measures
PairsSelect <-
Results$pairs %>%
select(id = id1, firstNameSim = firstName, lastNameSim = lastName)
# Join original data & string comparison results together
# reorganize data to facilitate review
JoinedResults <-
left_join(Names1, Names2) %>%
left_join(PairsSelect) %>%
select(id, firstNameSim, firstName, firstName2, lastNameSim, lastName, lastName2) %>%
arrange(desc(lastNameSim), desc(firstNameSim), id)
JoinedResults
# id firstNameSim firstName firstName2 lastNameSim lastName lastName2
# 1 2 1 Doug Doug 1.0000000 Shapiro Shapiro
# 2 4 1 Bubbles Bubbles 1.0000000 Murphy Murphy
# 3 1 1 Chris Chris 0.9800000 MacDougall MacDougal
# 4 3 1 Shintaro Shintaro 0.9500000 Yamazaki Yamazaku
# 5 5 1 Elsa Elsa 0.9384615 Elizabeth Ray Elizabeth
# If you want to collect just the perfect matches
PerfectMatches <-
JoinedResults %>%
filter(firstNameSim == 1 & lastNameSim == 1) %>%
select(id, firstName, lastName)
PerfectMatches
# id firstName lastName
# 1 2 Doug Shapiro
# 2 4 Bubbles Murphy
# To collect the matches that are going to need alignment
ImperfectMatches <-
JoinedResults %>%
filter(firstNameSim < 1 | lastNameSim < 1) %>%
mutate(flgFrstNm = 0, flgLstNm = 0)
ImperfectMatches
# id firstNameSim firstName firstName2 lastNameSim lastName lastName2 flgFrstNm flgLstNm
# 1 1 1 Chris Chris 0.9800000 MacDougall MacDougal 0 0
# 2 3 1 Shintaro Shintaro 0.9500000 Yamazaki Yamazaku 0 0
# 3 5 1 Elsa Elsa 0.9384615 Elizabeth Ray Elizabeth 0 0
#
# If you want to enter your column preference in a flag column to facilitate faster rectification...
write.csv(ImperfectMatches, "ImperfectMatches.csv", na = "", row.names = F)
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
# Flag data externally - save file to new name with '_reviewed' appended to filename
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
#reload results
FlaggedMatches <- read.csv("ImperfectMatches_reviewed.csv", stringsAsFactors = F)
FlaggedMatches
## Where a 1 is the 1st data set preferred and 0 (or 2 if that is easier for the 'data processor') means the 2nd data set is preferred.
# id firstNameSim firstName firstName2 lastNameSim lastName lastName2 flgFrstNm flgLstNm
# 1 1 1 Chris Chris 0.9800000 MacDougall MacDougal 1 0
# 2 3 1 Shintaro Shintaro 0.9500000 Yamazaki Yamazaku 1 1
# 3 5 1 Elsa Elsa 0.9384615 Elizabeth Ray Elizabeth 1 0
## Executing Assembly of preferred/rectified firstName and lastName columns
ResolvedMatches <-
FlaggedMatches %>%
mutate(rectifiedFirstName = ifelse(flgFrstNm == 1,firstName, firstName2),
rectifiedLastName = ifelse(flgLstNm == 1, lastName, lastName2)) %>%
select(id, starts_with("rectified"))
ResolvedMatches
# id rectifiedFirstName rectifiedLastName
# 1 1 Chris MacDougal
# 2 3 Shintaro Yamazaki
# 3 5 Elsa Elizabeth
dplyr非常直观,但compare.linkage()
函数可以使用一些解释。
前两个参数很明显:您要比较的两个数据帧(dataframe1和dataframe2)。 [如果您只想将onedataframe内的记录与自身进行比较(以重复记录记录集),那么您可以改为使用compare.dedup()
,并仅引用一个数据帧。
在这种情况下,将blockfld
设置为1或2将分别指定First Name或Last Name上的匹配项必须为100%。相反,您可能希望在数据集中包含主/外键并在blckfld参数中引用该列。或者,如果您的记录实际上并非如此等效构建,则可以完全保留此参数(默认为FALSE
),然后将比较所有可能的组合[数据帧的叉积]。
strcmp
到TRUE
会为您提供应用于您要比较的数据列的字符串距离函数;如果你把它保留为假,那么它只是测试精确的1:1字符串对应。
exclude
也是避免构建中间数据框并仅选择您希望彼此比较的列的好方法:排除3
只是允许我们删除宠物和住宅的比较结果。
上面代码中的4列,键控数据帧(不是原始问题3列数据帧)产生的结果如下:
# $data1
# firstName lastName pet
# 1 Chris MacDougall Cat
# 2 Doug Shapiro Dog
# 3 Shintaro Yamazaki Cat
# 4 Bubbles Murphy Dog
# 5 Elsa Elizabeth Ray Cat
# $data2
# firstName2 lastName2 dwelling
# 1 Chris MacDougal House
# 2 Doug Shapiro House
# 3 Shintaro Yamazaku Apartment
# 4 Bubbles Murphy Condo
# 5 Elsa Elizabeth House
# $pairs
# id1 id2 id firstName lastName is_match
# 1 1 1 1 1 0.9800000 NA
# 2 2 2 1 1 1.0000000 NA
# 3 3 3 1 1 0.9500000 NA
# 4 4 4 1 1 1.0000000 NA
# 5 5 5 1 1 0.9384615 NA
# $frequencies
# id firstName lastName
# 0.200 0.200 0.125
# $type
# [1] "linkage"
# attr(,"class")
# [1] "RecLinkData"
上面的每个部分(例如$ pair)都是它自己的数据框。 添加一个键,您可以将它们全部连接在一起然后引用并使用成对的df作为切换级别门,然后甚至将data1值复制到data2帧中,例如当您有一个&gt;配对评级中的0.95值。 (注意:is_match看起来很重要,但它适用于训练匹配工具,与我们的任务无关。)
无论如何,我希望你发现这些图书馆能够突然增加你的工作量,就像我第一次遇到它时一样令人头疼。
BTW:我还发现这个Comparison of String Distance Algorithms是对目前可用的字符串距离指标的一个很好的调查。
答案 2 :(得分:0)
使用标准adist()
函数作为@alistaire建议提供了一种非常有效的方法(并且可能更有可能是教师寻求使用的。)adist
的字符串度量受到限制广义的Levenshtein(编辑)距离,但这看起来正是你正在寻找的。 p>
代码如下: (因为这看起来像是专门用于数据处理的R编码类的介绍,我在一些最佳实践中添加了抛光到可重现/提出的问题。)
library(dplyr)
id <- c(1:5)
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)
id <- c(1:5)
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)
# NB: technically you could merge these data frames later with `bind_cols()` but best
# datahandling practices dictate joining/comparing data based on keys (instead of
# binding columns together based upon the order in which tables are initially arranged.)
#[also preference is for column headers to be singular and lower case, and tables/dataframes to be uppercase and plural - from (or extension from principles in): https://google.github.io/styleguide/Rguide.xml]
## adist() calculates string distance b/w records in two data frames
# Matrix between all columns is great way to ascertain similarity of data
# on overall column to column basis.
# 0 is closest resemblance, higher numbers are lowest resemblance
ResultsInterColumnComparison <-
adist(Names1, Names2, partial = T)
ResultsInterColumnComparison
# firstName to firstName2 & lastName to LastName2 are similar columns.
# id firstName2 lastName2 dwelling
# id 0 2 2 2
# firstName 15 0 3 4
# lastName 15 3 0 5
# pet 15 5 4 3
# adist column to column DifferenceCount (using dplyr)
dltFrstN <- diag(adist(Names1$firstName, Names2$firstName2, partial = T))
dltLstN <- diag(adist(Names1$lastName, Names2$lastName2, partial = T))
# Join all info together
DFcompilation <-
data.frame(id, dltFrstN, firstName, firstName2, dltLstN, lastName, lastName2) %>%
arrange(desc(dltLstN), desc(dltLstN))
DFcompilation
# id dltFrstN firstName firstName2 dltLstN lastName lastName2
# 1 5 0 Elsa Elsa 4 Elizabeth Ray Elizabeth
# 2 1 0 Chris Chris 1 MacDougall MacDougal
# 3 3 0 Shintaro Shintaro 1 Yamazaki Yamazaku
# 4 2 0 Doug Doug 0 Shapiro Shapiro
# 5 4 0 Bubbles Bubbles 0 Murphy Murphy
这种方法更简单,所需的编码更简洁。我希望这对你的目的更有帮助。