示例数据帧

Question

我有一个时间周期可能重叠的数据集，向我显示是否有人在场（example_df。我想获得一个数据集，该数据集将较长的时间段（从2014年1月1日到2014年10月31日）分成较小的时间段（有人在场（present = 1）和不存在的时间段（ present = 0）。结果应类似于result_df

示例数据帧

example_df <- data.frame(ID = 1, 
                     start = c(as.Date("2014-01-01"), as.Date("2014-03-05"), as.Date("2014-06-13"), as.Date("2014-08-15")), 
                     end = c(as.Date("2014-04-07"), as.Date("2014-04-12"), as.Date("2014-08-05"), as.Date("2014-10-02")), 
                     present = 1)

结果应如下所示

result_df <- data.frame(ID = 1, 
                     start = c(as.Date("2014-01-01"), as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02")), 
                     end = c(as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02"), as.Date("2014-10-31")), 
                     present = c(1, 0, 1, 0, 1, 0))

我不知道如何解决此问题，因为它需要分割时间段或添加行（或其他东西？）。任何帮助深表感谢！

Answer 1

我也希望能有所帮助，因为我也为此感到困惑。

就像IceCreamToucan的示例一样，它假定按人员ID具有独立性。此方法使用dplyr查看日期范围内的重叠部分，然后将其展平。该方法的其他examples已在stackoverflow中进行了描述，并使用dplyr。最终结果包括此人在场的时间范围。

library(tidyr)
library(dplyr)

pres <- example_df %>%
  group_by(ID) %>%
  arrange(start) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = min(start), end = max(end), present = 1) %>%
  select(-indx)

然后，可以添加其他行来指示不存在的时间段。在这些情况下，对于给定的ID，它将确定较旧的结束日期和较新的（较新的）开始日期之间的时间间隔。然后最后按ID和开始日期对结果进行排序。

result <- pres

for (i in unique(pres$ID)) {
  pres_i <- subset(pres, ID == i)
  if (nrow(pres_i) > 1) {
    adding <- data.frame(ID = i, start = pres_i$end[-nrow(pres_i)]+1, end = pres_i$start[-1]-1, present = 0)
    adding <- adding[adding$start <= adding$end, ]
    result <- bind_rows(result, adding)
  }
}
result[order(result$ID, result$start), ]

# A tibble: 5 x 4
# Groups:   ID [1]
     ID start      end        present
  <dbl> <date>     <date>       <dbl>
1     1 2014-01-01 2014-04-12       1
2     1 2014-04-13 2014-06-12       0
3     1 2014-06-13 2014-08-05       1
4     1 2014-08-06 2014-08-14       0
5     1 2014-08-15 2014-10-02       1

Answer 2

假设您要为每个ID分别进行操作，则可以创建一个数据表，其中包含某人在场的所有日期，并将该数据表与该时间段内的所有日期的表连接在一起。结果不完全相同，因为当前和不存在期间不重叠。

library(data.table)
setDT(example_df)


example_df[, {
  pres <- unique(unlist(Map(`:`, start, end)))
  class(pres) <- 'Date'
  all <- min(pres):max(pres)
  class(all) <- 'Date'
  pres <- data.table(day = pres)
  all <- data.table(day = all)
  out.full <- pres[all, on = .(day), .(day = i.day, present = +!is.na(x.day))]
  out.full[, .(start = min(day), end = max(day)), 
           by = .(present, rid = rleid(present))][, -'rid']
  }, by = ID]

#    ID present      start        end
# 1:  1       1 2014-01-01 2014-04-12
# 2:  1       0 2014-04-13 2014-06-12
# 3:  1       1 2014-06-13 2014-08-05
# 4:  1       0 2014-08-06 2014-08-14
# 5:  1       1 2014-08-15 2014-10-02

如何添加时间段介于给定时间段之间的行？

示例数据帧

结果应如下所示

2 个答案: