查找多个表的匹配项:使用data.table

时间:2017-04-27 14:23:24

标签: r join data.table left-join match

这可能有一个简单的解决方案,但我似乎无法解决它。

例如,假设我有一个列出购买和客户详细信息的表格:

library(data.table)
purchase <- setDT(structure(list(Name = c("John", "John", "Mary"), Surname = c("Smith", 
"Smith", "Jane"), PurchaseDate = c("2017-01-01", "2015-01-01", 
"2017-01-02")), .Names = c("Name", "Surname", "PurchaseDate"), row.names = c(NA, 
-3L), class = c("data.table", "data.frame")))

> purchase
   Name Surname PurchaseDate
1: John   Smith   2017-01-01
2: John   Smith   2015-01-01
3: Mary    Jane   2017-01-02

我想知道这些客户是否在购买时持有有效的折扣卡,该卡与两个数据库中的数据相匹配:

df1 <- setDT(structure(list(Name = "John", Surname = "Smith", ValidFrom = "2016-12-31", 
    ValidTo = "2017-01-02"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

df2 <- setDT(structure(list(Name = "Mary", Surname = "Jane", ValidFrom = "2017-01-01", 
    ValidTo = "2017-01-03"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

> df1
   Name Surname  ValidFrom    ValidTo
1: John   Smith 2016-12-31 2017-01-02
> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03

我正在调整使用data.table

this解决方案
library(data.table)
purchase[df1, on=c(Name='Name', Surname='Surname'), Match := 'Yes']
purchase[df2, on=c(Name='Name', Surname='Surname'), Match := 'Yes']

此结果(基于左连接)将保存到原始Match表中的purchase变量中。 (重要的是,这不需要创建新对象,但会将结果保存到原始对象,否则会变得混乱。)

> purchase
   Name Surname PurchaseDate Match
1: John   Smith   2017-01-01   Yes
2: John   Smith   2015-01-01   Yes
3: Mary    Jane   2017-01-02   Yes

但是,我还需要检查PurchaseDate是否在ValidFromValidTo日期之内,并且不太了解如何执行此操作。

为此,我可以将ValidFromValidTo日期带入加入,然后使用ifelse确定购买是否在这些日期之间。

purchase[df1, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]
purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

大!这带来了日期:

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 2017-01-01 2017-01-03

但是,如果客户有两张折扣卡,那么问题就出现了,购买只在其中一张有效期内出现。假设玛丽有两张牌:

df2 <- setDT(structure(list(Name = structure(c(1L, 1L), .Label = "Mary", class = "factor"), 
    Surname = structure(c(1L, 1L), .Label = "Jane", class = "factor"), 
    ValidFrom = structure(1:2, .Label = c("2017-01-01", "1945-01-01"
    ), class = "factor"), ValidTo = structure(1:2, .Label = c("2017-01-03", 
    "1946-01-01"), class = "factor")), .Names = c("Name", "Surname", 
"ValidFrom", "ValidTo"), row.names = c(NA, -2L), class = c("data.table", "data.frame")))

> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03
2: Mary    Jane 1945-01-01 1946-01-01

运行此

purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

只提供其中一对日期(显然是一对日期,无论行号如何)。

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 1945-01-01 1946-01-01

我如何引入所有匹配的行?

根据我的学习,X[Y]语法支持附加到原始对象(我需要),以及我需要的:=函数,但不支持完全连接。另一个merge支持完全连接,但需要在每个连接步骤创建新对象(将非常混乱),并且不支持:=。有任何想法吗?有办法以某种方式使用foverlaps吗?

1 个答案:

答案 0 :(得分:2)

这是接近它的一种方式:

# clean data
purchase[, PurchaseDate := as.IDate(PurchaseDate)]
df1[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]
df2[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]

# initialize
purchase[, matched := FALSE ]

# update joins
purchase[!(matched), matched := 
  df1[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]
purchase[!(matched), matched := 
  df2[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]

我保持df1df2分开,因为OP提到他们的加入规则在实际使用情况上有所不同。

工作原理

总体结构是......

DT[, matched := FALSE ]
DT[!(matched), matched := expr1 ]
DT[!(matched), matched := expr2 ]

因此我们将matched初始化为false;并在每个后续步骤中,更新不匹配的行!(matched)

表达式以DT2[.SD, ...]开头,这只是我们使用!(matched)过滤后对数据子集的连接。这样的联接根据.SD过滤器在DT2中查找on=行。在这种情况下,on=过滤器与非等联接关联。***

当我们使用by=.EACHI时,我们按.SD的每一行进行分组。使用.N, by=.EACHI,我们会得到DT2每行匹配的.SD行数。

获得匹配行数后,我们可以将N > 0L与更新matched进行比较。

***遗憾的是,截至2017年4月there's an open bug这种使用模式有时会出现.SD错误。解决方法是将.SD替换为copy(.SD)