Question

考虑表单的data.table结构

     seller    buyer      month  
1: 50536344 61961225 1993-01-01  
2: 50536344 61961225 1993-02-01 
3: 50536344 61961225 1993-04-01 
4: 50536344 61961225 1993-05-01 
5: 50536344 61961225 1993-06-01

我随着时间的推移有(buyer, seller)对。我想标记每对的开始和结束。例如，我们看到1月到2月有一对，3月没有，4月到6月有一对。因此，以下是预期的输出：

     seller    buyer      month  start    end
1: 50536344 61961225 1993-01-01   True  False
2: 50536344 61961225 1993-02-01  False   True
3: 50536344 61961225 1993-04-01   True  False
4: 50536344 61961225 1993-05-01  False  False
5: 50536344 61961225 1993-06-01  False   True

Answer 1

假设month位于Date类（或类似于POSIXt，IDateTime或其他具有diff方法的类），您可以使用diff函数执行此操作。

# sort data.table
setkeyv(dt, c("seller", "buyer", "month"))
# define start
dt[, start := c(TRUE, diff(month) > 31), by = list(seller, buyer)]
# define end
dt[, end := c(diff(month) > 31, TRUE), by = list(seller, buyer)]

编辑：根据@David Arenburg的建议：您当然可以一次定义开始和结束。这应该稍快一点，虽然我也觉得它读起来有点困难。

dt[, ":=" (start = c(TRUE, diff(month) > 31),
           end = c(diff(month) > 31, TRUE)), 
   by = list(seller, buyer)]

EDIT2：对正在发生的事情进行更多说明：对每对卖方和买方的第一次观察将始终是业务关系的开始，所以start = c(TRUE, ...)。在此之后，当且仅当时间差异大于一个月（31天）时，进一步观察将是业务关系的开始，因此diff(month) > 31。把这两件事放在一起，就得到c(TRUE, diff(month) > 31)。类似的逻辑适用于结束，您必须与下一个观察而不是前一个观察进行比较。

标记组的开始和结束

1 个答案: