在这个玩具示例中(实际上是大数据)我有一个带有个人记录的数据框,我需要在两个数据库中找到它们匹配。
df <- structure(list(FORENAME = c("GUY", "JULIANA", "BENN"), SURNAME = c("WEEKS",
"BAIN", "SARAH"), DOB = c("07/01/1985", "20/06/1967", "04/09/1985"
)), .Names = c("FORENAME", "SURNAME", "DOB"), row.names = c(NA,
-3L), class = "data.frame")
> df
FORENAME SURNAME DOB
1 GUY WEEKS 07/01/1985
2 JULIANA BAIN 20/06/1967
3 BENN SARAH 04/09/1985
数据库1
db1 structure(list(FORENAME = c("GUY", "SARAH", "REBECCA"), SURNAME = c("WEEKS",
"BENN", "SYMES"), DOB = c("07/01/1985", "04/09/1985", "10/07/1990"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME",
"SURNAME", "DOB"))
> db1
FORENAME SURNAME DOB
1 GUY WEEKS 07/01/1985
2 SARAH BENN 04/09/1985
3 REBECCA SYMES 10/07/1990
数据库2
db2 <- structure(list(FORENAME = c("NAILA", "JOANNE", "JULIANA"), SURNAME = c("KHAN",
"WHITEHEAD", "BAIN"), DOB = c("06/01/1957", "24/08/1970", "20/06/1967"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME",
"SURNAME", "DOB"))
> db2
FORENAME SURNAME DOB
1 NAILA KHAN 06/01/1957
2 JOANNE WHITEHEAD 24/08/1970
3 JULIANA BAIN 20/06/1967
为了举例,我希望应用两个匹配标准:
...并在任一数据库中查找匹配项,例如:
((df1$FORENAME == db1$FORENAME | db2$FORENAME) &
(df1$SURNAME == db1$SURNAME | db2$SURNAME) &
(df1$DOB == db1$DOB | db2$DOB))
|
((df1$FORENAME == db1$SURNAME | db2$SURNAME) &
(df1$SURNAME ==db1$FORENAME | db2$FORENAME) &
(df1$DOB == db1$DOB | db2$DOB))
我想保持原始的df
完好无损,并希望为结果创建一个columnt,如下所示:
FORENAME SURNAME DOB RESULT
1 GUY WEEKS 07/01/1985 MATCH
2 JULIANA BAIN 20/06/1967 MATCH
3 BENN SARAH 04/09/1985 MATCH
如果实际数据长达数十万行,并且将涉及c上的匹配,那么这样做的方法是什么呢? 8列和c。 15个标准?
由于这不需要任何模糊匹配,我能想到的一个不优雅的解决方案是做一系列dplyr
inner_join
s后跟{{1}把所有东西都放到一张大桌子里rbind
。然后unique(df_matched)
,以便在原始left_join(df, df_matched, by=c(...list of columns))
中创建RESULT
列。