Update by reference with rolling join

时间:2016-08-31 17:59:20

标签: r data.table

How do I update a data.table by reference using a rolling join?

Example

Here I build a table of commercials and a table of sales.

# commercials
commercials<-data.table(commercialID=c("C1","C2","C3","C4"), commercialDate=as.Date(c("2014-1-1","2014-4-1","2014-7-1","2014-9-15")))
commercials[, rollDate:=commercialDate] #Add a column, rollDate equal to commercialDate
setkey(commercials, "rollDate")

commercials
   commercialID commercialDate   rollDate
1:           C1     2014-01-01 2014-01-01
2:           C2     2014-04-01 2014-04-01
3:           C3     2014-07-01 2014-07-01
4:           C4     2014-09-15 2014-09-15


# sales (A sale before all commercials, a sale after commercial1 and a sale after commercial2)
sales <- data.table(saleID=c("S0", "S1", "S2"), saleDate=as.Date(c("2010-12-31", "2014-2-1", "2014-5-1")))
sales[, rollDate:=saleDate]
setkey(sales, "rollDate")

sales
   saleID   saleDate   rollDate
1:     S0 2010-12-31 2010-12-31
2:     S1 2014-02-01 2014-02-01
3:     S2 2014-05-01 2014-05-01

I'd like to associate each commercial with the sale immediately before it and the sale immediately after it (inclusive of a sale on the same date as the commercial).


Attemp1

commercials[sales, roll=-Inf, `:=`(saleIDBefore=saleID, saleDateBefore=saleDate)]
commercials[sales, roll=Inf, `:=`(saleIDAfter=saleID, saleDateAfter=saleDate)]

commercials
   commercialID commercialDate   rollDate saleIDBefore saleDateBefore saleIDAfter saleDateAfter
1:           C1     2014-01-01 2014-01-01           S0     2010-12-31          S1    2014-02-01
2:           C2     2014-04-01 2014-04-01           S1     2014-02-01          S2    2014-05-01
3:           C3     2014-07-01 2014-07-01           S2     2014-05-01          NA          <NA>
4:           C4     2014-09-15 2014-09-15           NA           <NA>          NA          <NA>

FAIL. Notice the NA under saleIDBefore.


Correct (but suboptimal) solution

commercials <- sales[commercials, roll=Inf]
setnames(commercials, c("saleID", "saleDate"), c("saleIDBefore", "saleDateBefore"))
commercials <- sales[commercials, roll=-Inf]
setnames(commercials, c("saleID", "saleDate"), c("saleIDAfter", "saleDateAfter"))

commercials
   saleIDAfter saleDateAfter   rollDate saleIDBefore saleDateBefore commercialID commercialDate
1:          S1    2014-02-01 2014-01-01           S0     2010-12-31           C1     2014-01-01
2:          S2    2014-05-01 2014-04-01           S1     2014-02-01           C2     2014-04-01
3:          NA          <NA> 2014-07-01           S2     2014-05-01           C3     2014-07-01
4:          NA          <NA> 2014-09-15           S2     2014-05-01           C4     2014-09-15

But I'd like to do this using update by reference := if it's possible.

1 个答案:

答案 0 :(得分:5)

Use non-equi joins:

commercials[sales, on = .(rollDate >= rollDate),
              `:=`(saleIDBefore=saleID, saleDateBefore=saleDate)][
            sales, on = .(rollDate <= rollDate), mult = 'last',
              `:=`(saleIDAfter=saleID, saleDateAfter=saleDate)][]
#   commercialID commercialDate   rollDate saleIDBefore saleDateBefore saleIDAfter saleDateAfter
#1:           C1     2014-01-01 2014-01-01           S0     2010-12-31          S1    2014-02-01
#2:           C2     2014-04-01 2014-04-01           S1     2014-02-01          S2    2014-05-01
#3:           C3     2014-07-01 2014-07-01           S2     2014-05-01          NA          <NA>
#4:           C4     2014-09-15 2014-09-15           S2     2014-05-01          NA          <NA>