我有时间序列数据
start_date_time ... process_duration_in_hours end_date_time
2019-01-01 05:37:19 ... 28,78 2019-01-02 10:24:24
2019-01-01 03:15:01 ... 12,00 2019-01-01 15:15:01
...
是其中的一些功能
我需要获取下一种形式的数据:
start_date ... process_duration_in_hours
2019-01-01 ... 18,37
2019-01-01 ... 12,00
2019-01-02 ... 10,41
如果我观察到process_duration_in_hours
比一天的剩余时间长,我想将此观察结果扩展到第二天,保留所有...
的特征并更改process_duration_in_hours
的值,第二天必须等于剩余的过程持续时间。同样,此过程可能需要一天以上的时间。
答案 0 :(得分:1)
可以做到:
library(data.table)
library(lubridate)
df$start_date_time <- as.POSIXct(df$start_date_time)
df$end_date_time <- as.POSIXct(df$end_date_time)
df <- setDT(df)[, `:=` (reps = pmax(1, floor(process_duration_in_hours / 24) + 1), id = .I)][
, df[df[, rep(.I, reps)]]][
reps > 1, process_duration_in_hours := {
process_duration_in_hours[.N] <- difftime(end_date_time[.N], floor_date(end_date_time[.N], "day"), units = "hours");
process_duration_in_hours[1] <- difftime(ceiling_date(start_date_time[1], "day", change_on_boundary = TRUE), start_date_time[1], units = "hours");
process_duration_in_hours[process_duration_in_hours > 24] <- 24;
round(process_duration_in_hours, 2)
}, by = id][, start_date_time := as.Date(substr(start_date_time, 1, 10)) + (0:(.N - 1)), by = id][, c("reps", "id", "end_date_time") := NULL]
我使用了更为复杂的数据:
df <- data.frame(
start_date_time = c(
"2019-01-01 05:37:19",
"2019-01-01 03:15:01",
"2019-01-02 04:00:00",
"2019-01-05 00:00:00"
),
process_duration_in_hours = c(28.78, 12.00, 56.00, 50.00),
end_date_time = c(
"2019-01-02 10:24:24",
"2019-01-01 15:15:01",
"2019-01-04 12:00:00",
"2019-01-07 02:00:00"
),
random_col = c("blabla", "dddd", "dddd", "eeee")
)
df
start_date_time process_duration_in_hours end_date_time random_col
1 2019-01-01 05:37:19 28.78 2019-01-02 10:24:24 blabla
2 2019-01-01 03:15:01 12.00 2019-01-01 15:15:01 dddd
3 2019-01-02 04:00:00 56.00 2019-01-04 12:00:00 dddd
4 2019-01-05 00:00:00 50.00 2019-01-07 02:00:00 eeee
输出:
start_date_time process_duration_in_hours random_col
1: 2019-01-01 18.38 blabla
2: 2019-01-02 10.41 blabla
3: 2019-01-01 12.00 dddd
4: 2019-01-02 20.00 dddd
5: 2019-01-03 24.00 dddd
6: 2019-01-04 12.00 dddd
7: 2019-01-05 24.00 eeee
8: 2019-01-06 24.00 eeee
9: 2019-01-07 2.00 eeee
答案 1 :(得分:0)
这是另一种解决方案,它使用foverlaps()
将给定的时间范围划分为一天的长度,并为每一段计算process_duration
。
library(data.table)
library(lubridate)
# create vector of start dates
start_date <- setDT(df)[, seq(floor_date(min(start_date_time), "day"),
max(end_date_time),
by = "1 day")]
# create keyed data.table with start and end of each day
day_grid <- data.table(start_date,
end = start_date + days(1),
key = "start_date,end")
# find overlaps of ranges in df with day_grid
df2 <- foverlaps(df, day_grid, by.x = c("start_date_time", "end_date_time"))
# compute durations
df2[, process_duration := difftime(
pmin(end, end_date_time),
pmax(start_date, start_date_time),
units = "hours")][
# clean up
process_duration > 0, .(start_date, process_duration, random_col)][
# sort output
order(start_date)]
start_date process_duration random_col 1: 2019-01-01 18.37806 hours blabla 2: 2019-01-01 12.00000 hours dddd 3: 2019-01-02 10.40667 hours blabla 4: 2019-01-02 20.00000 hours dddd 5: 2019-01-03 24.00000 hours dddd 6: 2019-01-04 12.00000 hours dddd 7: 2019-01-05 24.00000 hours eeee 8: 2019-01-06 24.00000 hours eeee 9: 2019-01-07 2.00000 hours eeee
这种方法的优点是可以轻松地适应不同的时间范围,例如小时,几周或几个月。
difftime
对象具有units
属性。因此,列名缩写为process_duration
。
为了进行比较,使用了arg0naut's answer的增强数据集。字符日期时间立即由POSIXct
强制转换为ymd_hms()
。
df <- data.frame(
start_date_time = ymd_hms(c(
"2019-01-01 05:37:19",
"2019-01-01 03:15:01",
"2019-01-02 04:00:00",
"2019-01-05 00:00:00"
)),
process_duration_in_hours = c(28.78, 12.00, 56.00, 50.00),
end_date_time = ymd_hms(c(
"2019-01-02 10:24:24",
"2019-01-01 15:15:01",
"2019-01-04 12:00:00",
"2019-01-07 02:00:00"
)),
random_col = c("blabla", "dddd", "dddd", "eeee")
)