跨多个行一次开始的观察到另一次结束的传播观察

时间:2019-02-21 20:58:21

标签: r time-series

我有时间序列数据

        start_date_time    ...   process_duration_in_hours           end_date_time  
    2019-01-01 05:37:19    ...                       28,78     2019-01-02 10:24:24 
    2019-01-01 03:15:01    ...                       12,00     2019-01-01 15:15:01

...是其中的一些功能

我需要获取下一种形式的数据:

    start_date   ...   process_duration_in_hours
    2019-01-01   ...                       18,37
    2019-01-01   ...                       12,00
    2019-01-02   ...                       10,41

如果我观察到process_duration_in_hours比一天的剩余时间长,我想将此观察结果扩展到第二天,保留所有...的特征并更改process_duration_in_hours的值,第二天必须等于剩余的过程持续时间。同样,此过程可能需要一天以上的时间。

2 个答案:

答案 0 :(得分:1)

可以做到:

library(data.table)
library(lubridate)

df$start_date_time <- as.POSIXct(df$start_date_time)
df$end_date_time <- as.POSIXct(df$end_date_time)

df <- setDT(df)[, `:=` (reps = pmax(1, floor(process_duration_in_hours / 24) + 1), id = .I)][
  , df[df[, rep(.I, reps)]]][
    reps > 1, process_duration_in_hours := {
      process_duration_in_hours[.N] <- difftime(end_date_time[.N], floor_date(end_date_time[.N], "day"), units = "hours");
      process_duration_in_hours[1] <- difftime(ceiling_date(start_date_time[1], "day", change_on_boundary = TRUE), start_date_time[1], units = "hours");
      process_duration_in_hours[process_duration_in_hours > 24] <- 24;
      round(process_duration_in_hours, 2)
    }, by = id][, start_date_time := as.Date(substr(start_date_time, 1, 10)) + (0:(.N - 1)), by = id][, c("reps", "id", "end_date_time") := NULL]

我使用了更为复杂的数据:

df <- data.frame(
  start_date_time = c(
    "2019-01-01 05:37:19",
    "2019-01-01 03:15:01",
    "2019-01-02 04:00:00",
    "2019-01-05 00:00:00"
  ),
  process_duration_in_hours = c(28.78, 12.00, 56.00, 50.00),
  end_date_time = c(
    "2019-01-02 10:24:24",
    "2019-01-01 15:15:01",
    "2019-01-04 12:00:00",
    "2019-01-07 02:00:00"
  ),
  random_col = c("blabla", "dddd", "dddd", "eeee")
)

df

      start_date_time process_duration_in_hours       end_date_time random_col
1 2019-01-01 05:37:19                     28.78 2019-01-02 10:24:24     blabla
2 2019-01-01 03:15:01                     12.00 2019-01-01 15:15:01       dddd
3 2019-01-02 04:00:00                     56.00 2019-01-04 12:00:00       dddd
4 2019-01-05 00:00:00                     50.00 2019-01-07 02:00:00       eeee

输出:

   start_date_time process_duration_in_hours random_col
1:      2019-01-01                     18.38     blabla
2:      2019-01-02                     10.41     blabla
3:      2019-01-01                     12.00       dddd
4:      2019-01-02                     20.00       dddd
5:      2019-01-03                     24.00       dddd
6:      2019-01-04                     12.00       dddd
7:      2019-01-05                     24.00       eeee
8:      2019-01-06                     24.00       eeee
9:      2019-01-07                      2.00       eeee

答案 1 :(得分:0)

这是另一种解决方案,它使用foverlaps()将给定的时间范围划分为一天的长度,并为每一段计算process_duration

library(data.table)
library(lubridate)
# create vector of start dates
start_date <- setDT(df)[, seq(floor_date(min(start_date_time), "day"), 
                              max(end_date_time),
                              by = "1 day")]
# create keyed data.table with start and end of each day
day_grid <- data.table(start_date, 
                       end = start_date + days(1), 
                       key = "start_date,end")
# find overlaps of ranges in df with day_grid
df2 <- foverlaps(df, day_grid, by.x = c("start_date_time", "end_date_time"))
# compute durations
df2[, process_duration := difftime(
  pmin(end, end_date_time),
  pmax(start_date, start_date_time),
  units = "hours")][
    # clean up
    process_duration > 0, .(start_date, process_duration, random_col)][
      # sort output
      order(start_date)]
   start_date process_duration random_col
1: 2019-01-01   18.37806 hours     blabla
2: 2019-01-01   12.00000 hours       dddd
3: 2019-01-02   10.40667 hours     blabla
4: 2019-01-02   20.00000 hours       dddd
5: 2019-01-03   24.00000 hours       dddd
6: 2019-01-04   12.00000 hours       dddd
7: 2019-01-05   24.00000 hours       eeee
8: 2019-01-06   24.00000 hours       eeee
9: 2019-01-07    2.00000 hours       eeee

这种方法的优点是可以轻松地适应不同的时间范围,例如小时,几周或几个月。

difftime对象具有units属性。因此,列名缩写为process_duration

数据

为了进行比较,使用了arg0naut's answer的增强数据集。字符日期时间立即由POSIXct强制转换为ymd_hms()

df <- data.frame(
  start_date_time = ymd_hms(c(
    "2019-01-01 05:37:19",
    "2019-01-01 03:15:01",
    "2019-01-02 04:00:00",
    "2019-01-05 00:00:00"
  )),
  process_duration_in_hours = c(28.78, 12.00, 56.00, 50.00),
  end_date_time = ymd_hms(c(
    "2019-01-02 10:24:24",
    "2019-01-01 15:15:01",
    "2019-01-04 12:00:00",
    "2019-01-07 02:00:00"
  )),
  random_col = c("blabla", "dddd", "dddd", "eeee")
)