将一个数据帧中的多个列与多个条件下的两个数据帧中的任意一个匹配

时间:2017-04-21 15:11:48

标签: r conditional match

在这个玩具示例中(实际上是大数据)我有一个带有个人记录的数据框,我需要在两个数据库中找到它们匹配。

df <- structure(list(FORENAME = c("GUY", "JULIANA", "BENN"), SURNAME = c("WEEKS", 
"BAIN", "SARAH"), DOB = c("07/01/1985", "20/06/1967", "04/09/1985"
)), .Names = c("FORENAME", "SURNAME", "DOB"), row.names = c(NA, 
-3L), class = "data.frame")

> df
  FORENAME SURNAME        DOB
1      GUY   WEEKS 07/01/1985
2  JULIANA    BAIN 20/06/1967
3     BENN   SARAH 04/09/1985

数据库1

db1 structure(list(FORENAME = c("GUY", "SARAH", "REBECCA"), SURNAME = c("WEEKS", 
"BENN", "SYMES"), DOB = c("07/01/1985", "04/09/1985", "10/07/1990"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME", 
"SURNAME", "DOB"))

> db1
  FORENAME SURNAME        DOB
1      GUY   WEEKS 07/01/1985
2    SARAH    BENN 04/09/1985
3  REBECCA   SYMES 10/07/1990

数据库2

db2 <- structure(list(FORENAME = c("NAILA", "JOANNE", "JULIANA"), SURNAME = c("KHAN", 
"WHITEHEAD", "BAIN"), DOB = c("06/01/1957", "24/08/1970", "20/06/1967"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME", 
"SURNAME", "DOB"))

> db2
  FORENAME   SURNAME        DOB
1    NAILA      KHAN 06/01/1957
2   JOANNE WHITEHEAD 24/08/1970
3  JULIANA      BAIN 20/06/1967

为了举例,我希望应用两个匹配标准:

  1. 所有三个字段相等
  2. FOREMNAME = SURNAME,SURNAME = FORENAME,DOB = DOB(用于捕获切换名称顺序的案例)
  3. ...并在任一数据库中查找匹配项,例如:

    ((df1$FORENAME == db1$FORENAME | db2$FORENAME) & 
    (df1$SURNAME == db1$SURNAME | db2$SURNAME) &
    (df1$DOB == db1$DOB | db2$DOB))
    |
    ((df1$FORENAME == db1$SURNAME | db2$SURNAME) &
    (df1$SURNAME ==db1$FORENAME | db2$FORENAME) &
    (df1$DOB == db1$DOB | db2$DOB))
    

    我想保持原始的df完好无损,并希望为结果创建一个columnt,如下所示:

      FORENAME SURNAME        DOB RESULT
    1      GUY   WEEKS 07/01/1985  MATCH
    2  JULIANA    BAIN 20/06/1967  MATCH
    3     BENN   SARAH 04/09/1985  MATCH
    

    如果实际数据长达数十万行,并且将涉及c上的匹配,那么这样做的方法是什么呢? 8列和c。 15个标准?

    由于这不需要任何模糊匹配,我能想到的一个不优雅的解决方案是做一系列dplyr inner_join s后跟{{1}把所有东西都放到一张大桌子里rbind。然后unique(df_matched),以便在原始left_join(df, df_matched, by=c(...list of columns))中创建RESULT列。

0 个答案:

没有答案