R日期范围数据帧到每小时持续时间总和

时间:2018-04-05 14:33:04

标签: r datetime dataframe

我有一个R数据框,包含事件的开始和结束时间,如下所示:

             timestamp        endtimestamp 
1  2018-03-27 10:00:27 2018-03-27 10:07:27 
2  2018-03-27 10:27:28 2018-03-27 10:37:58 
3  2018-03-27 10:52:59 2018-03-27 11:01:29 
4  2018-03-27 11:17:59 2018-03-27 11:27:00 
5  2018-03-27 12:03:29 2018-03-27 12:15:59 
6  2018-03-27 12:51:00 2018-03-27 13:01:30 
7  2018-03-27 13:18:31 2018-03-27 13:26:01 
8  2018-03-27 13:42:56 2018-03-27 13:50:56 
9  2018-03-27 14:08:26 2018-03-27 14:21:27 
10 2018-03-27 14:36:02 2018-03-27 14:43:58 

我想要转换数据,以便我有每小时范围,其中只有在一小时内发生的事件持续时间总和(例如,一小时内开始而下一小时结束的事件只计算其中的部分每小时范围)导致:

        starttimestamp        endtimestamp    duration
1  2018-03-27 10:00:00 2018-03-27 11:00:00   1471 secs
2  2018-03-27 11:00:00 2018-03-27 12:00:00    630 secs
3  2018-03-27 12:00:00 2018-03-27 13:00:00   1290 secs
4  2018-03-27 13:00:00 2018-03-27 14:00:00   1020 secs
5  2018-03-27 14:00:00 2018-03-27 15:00:00   1257 secs

我想我可以通过一个循环来做到这一点,虽然它感觉很笨,但我尝试使用dplyr / magrittr的任何解决方案似乎都不起作用。

例如:结果中的1471秒值由以下公式计算:

2018-03-27 10:00:27至2018-03-27 10:07:27 = 420秒

2018-03-27 10:27:28到2018-03-27 10:37:58 = 630秒

2018-03-27 10:52:59到2018-03-27 11:00:00 = 421秒

420 + 630 + 421 = 1471秒

请注意,最终范围是在小时停止,而不是转到11:01:29。 01:29被添加到下一个值。

任何帮助都将不胜感激。

复制数据框的代码:

test <- data.frame(IDX = c(1:10),
           timestamp = c(as.POSIXct("2018-03-27T10:00:27Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T10:27:28Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T10:52:59Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T11:17:59Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T12:03:29Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T12:51:00Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T13:18:31Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T13:42:56Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T14:08:26Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                         as.POSIXct("2018-03-27T14:36:02Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC")
           ),
           endtimestamp = c(as.POSIXct("2018-03-27T10:07:27Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T10:37:58Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T11:01:29Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T11:27:00Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T12:15:59Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T13:01:30Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T13:26:01Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T13:50:56Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T14:21:27Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC"),
                            as.POSIXct("2018-03-27T14:43:58Z", format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC")
           ))

2 个答案:

答案 0 :(得分:1)

这似乎有效。我们的想法是设置一个base_time,您可以从中扣除任何多余的时间。然后,您获取lag列的excess,以便它与duration1列的下一行匹配。 excessduration1的总和为duration。然后,您将durationtimestamp_hourendtimestamp_hour相加,得出最终结果。

df %>%
  mutate(timestamp_hour = floor_date(timestamp, unit = 'hours'),
         endtimestamp_hour = timestamp_hour + hours(1)) %>%
  mutate(base_time = round_date(timestamp, unit = 'hours')) %>%
  mutate(excess = ifelse((endtimestamp > base_time) & (timestamp < base_time), difftime(endtimestamp, base_time, units = 'secs'), 0)) %>%
  mutate(duration1 = ifelse((endtimestamp > base_time) & (timestamp < base_time), difftime(base_time, timestamp, unit = 'secs'), difftime(endtimestamp, timestamp, units = 'secs'))) %>%
  mutate_at(vars(excess), lag, default = 0) %>%
  mutate(duration = excess + duration1) %>%
  group_by(timestamp_hour, endtimestamp_hour) %>%
  summarise(duration = sum(duration))

数据

library(tidyverse)
library(lubridate)

tt <- 'timestamp,        endtimestamp 
2018-03-27 10:00:27, 2018-03-27 10:07:27 
2018-03-27 10:27:28, 2018-03-27 10:37:58 
2018-03-27 10:52:59, 2018-03-27 11:01:29 
2018-03-27 11:17:59, 2018-03-27 11:27:00 
2018-03-27 12:03:29, 2018-03-27 12:15:59 
2018-03-27 12:51:00, 2018-03-27 13:01:30 
2018-03-27 13:18:31, 2018-03-27 13:26:01 
2018-03-27 13:42:56, 2018-03-27 13:50:56 
2018-03-27 14:08:26, 2018-03-27 14:21:27 
2018-03-27 14:36:02, 2018-03-27 14:43:58' 


df <- read.table(text = tt, header = T, sep = ',')

df <- df %>% mutate(
  timestamp = as.POSIXct(timestamp),
  endtimestamp = as.POSIXct(endtimestamp)
)

输出

# A tibble: 5 x 3
# Groups:   timestamp_hour [?]
  timestamp_hour      endtimestamp_hour   duration
  <dttm>              <dttm>                 <dbl>
1 2018-03-27 10:00:00.000 2018-03-27 11:00:00.000    1471.
2 2018-03-27 11:00:00.000 2018-03-27 12:00:00.000     630.
3 2018-03-27 12:00:00.000 2018-03-27 13:00:00.000    1290.
4 2018-03-27 13:00:00.000 2018-03-27 14:00:00.000    1020.
5 2018-03-27 14:00:00.000 2018-03-27 15:00:00.000    1257.

答案 1 :(得分:1)

我可能会......

library(data.table)
setDT(test)

durDT = test[, {
  hr  = seq(trunc(timestamp, "hour"), trunc(endtimestamp, "hour"), by="hour")
  dur = structure(rep(3600, length(hr)), units="secs", class="difftime")

  n = length(hr)
  if (n==1){
    dur = difftime(endtimestamp, timestamp, unit = "secs")
  } else {
    dur[1] <- difftime(hr[1] + 3600, timestamp, unit = "secs")
    dur[n] <- difftime(endtimestamp, hr[n], unit = "secs")
  }
  .(hr = hr, dur = dur)
}, by=IDX]

durDT[, .(total_dur = sum(dur)), by=hr]

给出了

> durDT
    IDX                  hr      dur
 1:   1 2018-03-27 06:00:00 420 secs
 2:   2 2018-03-27 06:00:00 630 secs
 3:   3 2018-03-27 06:00:00 421 secs
 4:   3 2018-03-27 07:00:00  89 secs
 5:   4 2018-03-27 07:00:00 541 secs
 6:   5 2018-03-27 08:00:00 750 secs
 7:   6 2018-03-27 08:00:00 540 secs
 8:   6 2018-03-27 09:00:00  90 secs
 9:   7 2018-03-27 09:00:00 450 secs
10:   8 2018-03-27 09:00:00 480 secs
11:   9 2018-03-27 10:00:00 781 secs
12:  10 2018-03-27 10:00:00 476 secs

> durDT[, .(total_dur = sum(dur)), by=hr]
                    hr total_dur
1: 2018-03-27 06:00:00 1471 secs
2: 2018-03-27 07:00:00  630 secs
3: 2018-03-27 08:00:00 1290 secs
4: 2018-03-27 09:00:00 1020 secs
5: 2018-03-27 10:00:00 1257 secs

此代码应适用于事件超过两小时的数据(但OP的示例不包括该情况)。

由于我处于不同的时区或某事,因此OP的时间已经过去了。