R data.table比较表和计数记录之间的日期

时间:2014-05-23 08:51:17

标签: r data.table

我有两个数据表:ab

a = structure(list(id = c(86246, 86252, 12262064), brand = c(3718L, 
13474L, 17286L), offerdate = structure(c(15454, 15791, 15883), class = "Date")), .Names = c("id", 
"brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>)

b = structure(list(id = c(86246, 86246, 86246), brand = c(3718, 3718, 
875), date = structure(c(15408, 15430, 15434), class = "Date")), .Names = c("id", 
"brand", "date"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>)

> a
         id brand  offerdate
1:    86246  3718 2012-04-24
2:    86252 13474 2013-03-27
3: 12262064 17286 2013-06-27
> b
      id brand       date
1: 86246  3718 2012-03-09
2: 86246  3718 2012-03-31
3: 86246   875 2012-04-04

现在我想,对于a中的每个id,要计算b中相同ID和品牌的行数,日期少于a.offerdate前30天。

我希望得到的结果是更新的a:

> a
         id brand  offerdate  nbTrans_last_30_days
1:    86246  3718 2013-04-24                     1
2:    86252 13474 2013-03-27                     0
3: 12262064 17286 2013-06-27                     0

我可以用子集完成工作,但我正在寻找一个快速的解决方案。 子集版本将是(对于a的每一行):

subset(b, (id == 86246) & (brand == 3718) & (date > as.Date("2012-03-24")) )

日期取决于a.offerdate

我设法计算b中的总行数:

> setkey(a,id, brand)
> setkey(b,id, brand)
> a = a[b[a, .N]]
> setnames(a, "N", "nbTrans")
> a
         id brand  offerdate nbTrans
1:    86246  3718 2012-04-24       2
2:    86252 13474 2013-03-27       0
3: 12262064 17286 2013-06-27       0

但我不知道如何处理两个表之间的日期比较。


下面的答案适用于原始的小数据集,但不知何故对我的真实数据不起作用。 我尝试用两个新变量重现问题:a2和b2

a2=structure(list(id = c(86246, 86252, 12262064), brand = structure(c(3L, 
+ 9L, 12L), .Label = c("875", "1322", "3718", "4294", "5072", "6732", 
+ "6926", "7668", "13474", "13791", "15889", "17286", "17311", 
+ "26189", "26456", "28840", "64486", "93904", "102504"), class = "factor"), 
+     offerdate = structure(c(15819, 15791, 15883), class = "Date")), .Names = c("id", 
+ "brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table", 
+ "data.frame"))

b2=structure(list(id = c(86246, 86246, 86246, 86246, 86246, 86246, 
+ 86246, 86246), brand = c(3718L, 3718L, 3718L, 3718L, 3718L, 3718L, 
+ 3718L, 3718L), date = structure(c(15423, 15724, 15752, 15767, 
+ 15782, 15786, 15788, 15811), class = "Date")), .Names = c("id", 
+ "brand", "date"), sorted = c("id", "brand"), class = c("data.table", 
+ "data.frame"))

> setkey(a2,id,brand)
> setkey(b2,id,brand)
> merge(a2, b2, all.x = TRUE, allow.cartesian = TRUE)
         id brand  offerdate date
1:    86246  3718 2013-04-24 <NA>
2:    86252 13474 2013-03-27 <NA>
3: 12262064 17286 2013-06-27 <NA>

问题是合并没有保留b2.date信息。

1 个答案:

答案 0 :(得分:2)

诀窍是在allow.cartesian中使用merge参数:

setkey(a, id, brand)
setkey(b, id, brand)

c <- merge(a, b, all.x = T, allow.cartesian = T)

c[, Trans := (offerdate - date) <= 30]

c[, list(nbTrans_last_30_days = sum(Trans, na.rm = T)),
  keyby = list(id, brand, offerdate)]