Question

我有一个data.table，我希望标记在给定组ID的先前输入后90天内的条目。背景是这些是交易的买入信号。所以我不想在90天的窗口内重复，因为我假设我持有这个位置90天，因此我已经买了一个位置（而且我不想重启时钟）。

所以我有：

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="foo">test</div>

我想得到：

library(data.table)
> dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
> dt
   id       date
1:  A 2017-01-01
2:  A 2017-02-01
3:  A 2017-05-01
4:  B 2017-01-01
5:  B 2017-05-01
6:  B 2017-10-01
7:  C 2017-01-01
8:  C 2017-02-01
9:  C 2017-02-15

我觉得我应该可以用.SD做到这一点，但我无法弄明白。谢谢你的帮助！

Answer 1

您可以使用difftime：

# Data
library(data.table)
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))

# Difference in days    
dt[, with.90d := as.numeric(difftime(date, shift(date), units = "days")) < 90, id]
dt[is.na(with.90d), with.90d := FALSE]

#    id       date with.90d
# 1:  A 2017-01-01    FALSE
# 2:  A 2017-02-01     TRUE
# 3:  A 2017-05-01     TRUE
# 4:  B 2017-01-01    FALSE
# 5:  B 2017-05-01    FALSE
# 6:  B 2017-10-01    FALSE
# 7:  C 2017-01-01    FALSE
# 8:  C 2017-02-01     TRUE
# 9:  C 2017-02-15     TRUE

说明：

使用difftime()计算时间差。差异在于按组（id）计算的日期和转移日期。
检查差异是否少于90天。
为每个组的第一个日期（FALSE）

is.na()

Answer 2

听起来您希望与之前的所有交易进行比较，以确保当前交易不在任何交易的90天内。要做到这一点，你可以尝试：

dt[order(id, date), with.90d := sapply(1:(.N), function(i) all(difftime(date[i], date[1:(i-1)], units = "days") < 90) & i != 1L), by = id]

dt
#   id       date with.90d
#1:  A 2017-01-01    FALSE
#2:  A 2017-02-01     TRUE
#3:  A 2017-05-01    FALSE
#4:  B 2017-01-01    FALSE
#5:  B 2017-05-01    FALSE
#6:  B 2017-10-01    FALSE
#7:  C 2017-01-01    FALSE
#8:  C 2017-02-01     TRUE
#9:  C 2017-02-15     TRUE

这样做是将当前日期与所有先前日期（在该组内）的差异，并检查所有这些差异是否＆lt; 90天。如果有＆gt; = 90，它将用FALSE标记它。请注意，我使用all()来返回逻辑，但您可以使用可能更快的min()。

Answer 3

您还可以使用基本功能：

transform(dt,with.90days=unlist(by(dt$date,dt$id,function(x)c(F,cumsum(as.numeric(diff(x)))<90))))
   id       date with.90days
1:  A 2017-01-01       FALSE
2:  A 2017-02-01        TRUE
3:  A 2017-05-01       FALSE
4:  B 2017-01-01       FALSE
5:  B 2017-05-01       FALSE
6:  B 2017-10-01       FALSE
7:  C 2017-01-01       FALSE
8:  C 2017-02-01        TRUE
9:  C 2017-02-15        TRUE

Answer 4

OP有requested：

从每组的第一次观察X开始，我想在距离X不到90天的时间内标记任何其他观察结果。然后对于距离X大于90天的第二天观察，称之为观察Y，我想要在Y的90天内标记任何观察结果。重复。

如果我从预期结果中正确理解，FALSE列中with.90d的值表示90天的开始时间。

不幸的是，下一个 90天期间的开始时间取决于前一个90天期限到期后下一次观察的日期。因此，我们不能在每组中从第一个日期开始使用固定的90天间隔。

我尝试使用非equi连接或滚动连接找到解决方案，但到目前为止，我最终采用了递归方法：

dt3[, with.90d := NA]
while (dt3[, any(is.na(with.90d))]) 
  dt3[is.na(with.90d), cd := date - min(date), by = id][
    is.na(with.90d) & cd == 0, with.90d := FALSE][
      is.na(with.90d) & cd <= 90, with.90d := TRUE]
dt3

    id       date with.90d      cd
 1:  A 2017-01-01    FALSE  0 days
 2:  A 2017-02-01     TRUE 31 days
 3:  A 2017-05-01    FALSE  0 days
 4:  B 2017-01-01    FALSE  0 days
 5:  B 2017-05-01    FALSE  0 days
 6:  B 2017-10-01    FALSE  0 days
 7:  C 2017-01-01    FALSE  0 days
 8:  C 2017-02-01     TRUE 31 days
 9:  C 2017-02-15     TRUE 45 days
10:  D 2017-03-01    FALSE  0 days
11:  D 2017-04-01     TRUE 31 days
12:  D 2017-05-01     TRUE 61 days
13:  D 2017-06-01    FALSE  0 days
14:  D 2017-07-01     TRUE 30 days
15:  D 2017-08-01     TRUE 61 days
16:  E 2017-01-01    FALSE  0 days
17:  E 2017-02-01     TRUE 31 days
18:  E 2017-03-01     TRUE 59 days
19:  E 2017-04-01     TRUE 90 days
20:  E 2017-05-01    FALSE  0 days
21:  E 2017-06-01     TRUE 31 days
    id       date with.90d      cd

请注意，我已向OP的示例数据集添加了另外两个组D和E，以便更好地验证该方法。另请注意，从D开始的2017-03-01和E开始的小组2017-01-01的结果会有所不同。

解释

只要NA中有with.90d个值，就会为NA行重复以下序列（TRUE行或FALSE值已完成）：

计算每组中第一个日期的差异。请注意，使用的min(date)也适用于无序数据集。或者，可以使用setorder(dt3, date)和first(date)（或date[1]）。
日期差异为0的行表示新期间的开始，并标记为FALSE。
日差小于或等于90天的行标记为TRUE。
所有其他行保持不变，即它们保持NA值。

为了说明，我保留了帮助栏cd。它可以通过dt3[, cd := NULL]删除。

数据

# OP's sample dataset
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), 
                 date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
# append group D
dt2 <- dt[, .(id = c(id, rep("D", 6)), 
              date = c(date, seq(as.Date("2017-03-01"), length.out = 6, by = "1 month")))]
# append group E
dt3 <- dt2[, .(id = c(id, rep("E", 6)), 
               date = c(date, seq(as.Date("2017-01-01"), length.out = 6, by = "1 month")))]

计算行之间的日期差异

4 个答案:

解释

数据