Question

我有一组观察结果，每当用户进行操作时都会记录下来。我只想过滤用户相距六个月或更长时间的观察结果。

因此，如果用户对“ 2018-01-01”，“ 2018-03-01”和“ 2018-07-01”采取了此操作，则我只想保留“ 2018-01-01”和“ 2018-07-01”。

同样，如果用户对“ 2018-01-01”，“ 2018-03-01”，“ 2018-07-01”和“ 2019-03-01”采取了操作，我只想保留“ 2018-01-01”，“ 2018-07-01”，“ 2019-03-01”。

到目前为止，我已经编写了很长且不可行的代码。

# What I want to achieve
library(data.table)

dataIhave <- data.table(id    = c(1, 1, 1, 1, 2, 2, 3, 4), 
                        dates = c("2018-01-01", 
                                  "2018-03-01",
                                  "2018-07-01",
                                  "2019-01-01",
                                  "2018-01-03", 
                                  "2018-07-02", 
                                  "2018-02-01",
                                  "2018-02-01"))

dataIwant <- data.table(id    = c(1, 1, 1, 2, 3, 4), 
                        dates = c("2018-01-01", 
                                  "2018-07-01",
                                  "2019-01-01",
                                  "2018-01-01", 
                                  "2018-02-01",
                                  "2018-02-01"))

Answer 1

这是@Uwe的答案的滚动联接变体：

library(lubridate)
dataIhave[, dates := as.IDate(dates)]

ids = unique(dataIhave$id)

dataIhave[, seq := NA_integer_]
s = 1L
w = dataIhave[.(ids), on=.(id), mult="first", which = TRUE]
dataIhave[w, seq := s]
while (TRUE){
  w = dataIhave[
    dataIhave[w, .(id, dates = dates %m+% months(6))], 
    on = .(id, dates), roll = -Inf, nomatch = 0, which = TRUE
  ]

  if (!length(w)) break
  s = s + 1L
  dataIhave[w, seq := s]
}

dataIhave[!is.na(seq)]

   id      dates seq
1:  1 2018-01-01   1
2:  1 2018-07-01   2
3:  1 2019-01-01   3
4:  2 2018-01-03   1
5:  3 2018-02-01   1
6:  4 2018-02-01   1

该循环将获取根据w定义的行id，将其dates向前移动六个月，并查找下一行（如果有）。连接的参数为：

具有连接语法x[i, ...]的表
- x = dataIhave
- i = dataIhave[w, .(id, dates = dates %m+% months(6))]
on = .(id, date)：要匹配的列
roll = -Inf：在on=
nomatch = 0：如果找不到匹配项，请跳过
which = TRUE：返回匹配的行号

此外，如果有重复的日期（请参阅@Uwe帖子中的第二个示例）：

mult = "first"：仅i每行的第一匹配项

在循环之前按id选择第一行时，我假设数据在dates中按id排序（所以我不使用order就像@Uwe的答案一样。）

Answer 2

如果我理解正确，那么OP希望删除距期初少于6个月的日期，并在距上期初6个月以上的第一个日期开始新的期时间段（每个id单独）。

我不知道如何通过非递归滚动或非等价联接来实现此目的，因为没有固定的日期网格。因此，我认为这某种程度上需要一种递归方法。这是一种可能性：

library(data.table)
library(lubridate)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := TRUE]
dataIhave[order(id, dates)
  , keep := {
    start <- dates[1L]
    for (i in tail(seq_along(dates), -1L)) {
      if (dates[i] < start %m+% months(6)) {
        keep[i] <- FALSE
      } else {
        start <- dates[i]
      }
    }
    keep
  }, by = id][]

   id      dates  keep
1:  1 2018-01-01  TRUE
2:  1 2018-03-01 FALSE
3:  1 2018-07-01  TRUE
4:  1 2019-01-01  TRUE
5:  2 2018-01-03  TRUE
6:  2 2018-07-02 FALSE
7:  3 2018-02-01  TRUE
8:  4 2018-02-01  TRUE

最后，

dataIhave[(keep), -"keep"]

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

第二个测试用例

此处的关键点是检测新周期的开始（在每个id中）。

作为另一个测试用例，我在id == 1中添加了两个日期， 2018-07-01和2018-07-02。
2018-07-01是重复项。这两个日期都应删除，因为它们都位于从2018-07-01开始的第二个6个月内。

dataIhave <- fread("
 id      dates
  1 2018-01-01
  1 2018-03-01
  1 2018-07-01
  1 2018-07-01
  1 2018-07-02
  1 2019-01-01
  2 2018-01-03
  2 2018-07-02
  3 2018-02-01
  4 2018-02-01")

实际上，上面的代码返回的输出与OP的原始测试用例相同。

每个`id`

仅在第一六个月内删除行

如果，该问题被解释为仅删除每个id在第一 6个月内的条目，并将所有日期保留在6个月之后，这样可以通过

dataIhave[!dataIhave[, .I[dates < dates[1L] %m+% months(6L)][-1L], by = id]$V1]

返回

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2018-07-01
4:  1 2018-07-02
5:  1 2019-01-01
6:  2 2018-01-03
7:  3 2018-02-01
8:  4 2018-02-01

第二个测试用例。（请注意，这是Jaap's answer的简化版本。）

Answer 3

另一个变体：

library(lubridate)
library(data.table)

dataIhave[, dates := as.Date(dates)]

dataIhave[, keep := dates >= dates[1] %m+% months(6), by = id
          ][dataIhave[, .I[1], by = id][[2]], keep := TRUE
            ][!!keep, -"keep"]

给出：

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

Answer 4

使用非等号联接和igraph来避免隐式循环和递归：

#data prep
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
setorder(dataIhave[, rn:=rowid(id)], id, dates)
dataIhave[, end := as.IDate(sapply(dates, 
    function(d) seq(d, by="6 months", length.out=2L)[2L]))]

#non-equi self join to find first date that is after 6months
nonequi <- dataIhave[dataIhave, on=.(id, dates>=end), mult="first", by=.EACHI,
    .(i.id, i.rn, x.rn, i.dates, x.dates)]

library(igraph)
nonequi[, {
        #create graph from the previous join
        g <- graph_from_data_frame(.SD[, .(i.rn, x.rn)])
        #plot(g)

        #find the leaf nodes
        leaf <- sapply(V(g), function(x) length(neighbors(g,x))==0L)

        #from the first date (i.e. node = V(g)["1"]), find the path starting from this date.
        path <- get.all.shortest.paths(g, V(g)["1"], leaf)$res

        #return all dates (i.e. nodes) in this path
        .(dates=i.dates[i.rn %in% na.omit(V(g)[path[[1L]]]$name)])
    },
    by=.(id=i.id)]

输出：

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

或者类似于Uwe解决方案的递归方法：

dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
unique(dataIhave[,
    .(dates=as.IDate(Reduce(
        function(x, y) if (y >= seq(x, by="6 months", length.out=2L)[2L]) y else x,
        dates,
        accumulate=TRUE))),
    .(id)])

输出：

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

Answer 5

library(lubridate)
library(data.table)

dataiHave[, dates := ymd(dates)]
dataiHave[, difDates := as.numeric(difftime(dates, units = "weeks"))]

dataIHave[difDates >= 24, .(id, dates)]

这会产生您想要的结果吗？

月份的持续时间是不规则的，因此您必须遵守固定持续时间的时间单位。

您还可以检查?lubridate::interval，lubridate::as. duration和以下问题：Time difference in years with lubridate?

使用data.table根据特定日期条件过滤观察值

5 个答案:

第二个测试用例

每个`id`

使用data.table根据特定日期条件过滤观察值

5 个答案:

第二个测试用例

每个id

每个`id`