如何在两种情况下合并两个data.table

时间:2019-03-24 15:44:09

标签: r merge data.table

我想合并两个表dt_programdt_sale,以使用公用键START和{{1来查找ENDCH }},条件如下:

  1. ITEM_ID必须位于ORDER_TIMESTART
  2. 之内

  1. END可以在ORDER_TIME之后发生(最接近END的{​​{1}})

提供了数据:

时间表表代表每个频道的节目:

ORDER_TIME

返回:

END

此外,我还有一个销售交易表,用于在客户购买产品时收集数据:

dt_program <- structure(list(CH = c("CH1", "CH1", "CH1", "CH1", "CH1", "CH2", 
        "CH2", "CH2", "CH3", "CH3", "CH3", "CH3"), ITEM_ID = c(110, 111, 
        110, 111, 110, 110, 111, 112, 114, 113, 110, 112), START = structure(c(1514791800, 
        1514799000, 1514806200, 1514813400, 1514820600, 1518602400, 1518609600, 
        1518616800.005, 1517560200, 1517565600, 1517570999.995, 1517576399.995
        ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), END = structure(c(1514795400, 
        1514802600, 1514809800.005, 1514817000.01, 1514824200.015, 1518604200, 
        1518611400, 1518618600, 1517563800, 1517569200, 1517574600, 1517580000
        ), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
        -12L), class = c("data.table", "data.frame"))

返回:

     CH ITEM_ID               START                 END
 1: CH1     110 2018-01-01 07:30:00 2018-01-01 08:30:00
 2: CH1     111 2018-01-01 09:30:00 2018-01-01 10:30:00
 3: CH1     110 2018-01-01 11:30:00 2018-01-01 12:30:00
 4: CH1     111 2018-01-01 13:30:00 2018-01-01 14:30:00
 5: CH1     110 2018-01-01 15:30:00 2018-01-01 16:30:00
 6: CH2     110 2018-02-14 10:00:00 2018-02-14 10:30:00
 7: CH2     111 2018-02-14 12:00:00 2018-02-14 12:30:00
 8: CH2     112 2018-02-14 14:00:00 2018-02-14 14:30:00
 9: CH3     114 2018-02-02 08:30:00 2018-02-02 09:30:00
10: CH3     113 2018-02-02 10:00:00 2018-02-02 11:00:00
11: CH3     110 2018-02-02 11:29:59 2018-02-02 12:30:00
12: CH3     112 2018-02-02 12:59:59 2018-02-02 14:00:00

我期望的输出:

dt_sale <- structure(list(CUST_ID = c("A001", "A001", "A001", "A002", "A002", 
"A003"), CH = c("CH1", "CH3", "CH2", "CH2", "CH3", "CH1"), ORDER_TIME = structure(c(1514793600, 
1514813400, 1518619200, 1514816100, 1517565600, 1514803200), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), ITEM_ID = c(110, 110, 112, 112, 114, 
111)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))

能否请您提出建议?

1 个答案:

答案 0 :(得分:2)

问题中显示的输出与问题开头的描述不匹配。第2行和第4行不应包含STARTEND的值。

使用双重联接的可能解决方案:

dt_sale[dt_program
        , on = .(CH, ITEM_ID, ORDER_TIME > START, ORDER_TIME < END)
        , `:=` (START = i.START, END = i.END)
        ][dt_program
          , on = .(CH, ITEM_ID, ORDER_TIME > END)
          , `:=` (START = i.START, END = i.END)][]

给出:

> dt_sale
   CUST_ID  CH          ORDER_TIME ITEM_ID               START                 END
1:    A001 CH1 2018-01-01 08:00:00     110 2018-01-01 07:30:00 2018-01-01 08:30:00
2:    A001 CH3 2018-01-01 13:30:00     110                <NA>                <NA>
3:    A001 CH2 2018-02-14 14:40:00     112 2018-02-14 14:00:00 2018-02-14 14:30:00
4:    A002 CH2 2018-01-01 14:15:00     112                <NA>                <NA>
5:    A002 CH3 2018-02-02 10:00:00     114 2018-02-02 08:30:00 2018-02-02 09:30:00
6:    A003 CH1 2018-01-01 10:40:00     111 2018-01-01 09:30:00 2018-01-01 10:30:00