使用data.table按组计算时差将表弄乱

时间:2019-07-10 19:01:21

标签: r data.table lubridate

目标是计算按ID分组的事件之间的时间。这是一个示例:

library(data.table)
library(lubridate)

dt <- data.table(id = c(1,1:3), 
                 start = c("2015-01-01 12:00:00", "2015-12-01 12:00:00", "2019-01-01 12:00:00", NA),
                 end = c("2016-01-01 12:00:01", "2016-01-01 12:00:01", "2019-01-01 12:00:01", "2019-01-01 12:00:02"))

dt[, start := ymd_hms(start)]
dt[, end := ymd_hms(end)]

dt[, time_diff_1 := min(end) - max(start), by = .(id)]
dt[, time_diff_2 := end - start]

结果为:

   id               start                 end   time_diff_1   time_diff_2
1:  1 2015-01-01 12:00:00 2016-01-01 12:00:01 31.00001 secs 31536001 secs
2:  1 2015-12-01 12:00:00 2016-01-01 12:00:01 31.00001 secs  2678401 secs
3:  2 2019-01-01 12:00:00 2019-01-01 12:00:01  1.00000 secs        1 secs
4:  3                <NA> 2019-01-01 12:00:02       NA secs       NA secs

time_diff_1time_diff_2均以秒为单位显示时差。但是,根据分组计算得出的time_diff_1将单位混合在一起。 id == 1的结果是31天零一秒。似乎是按组自动选择了单位,然后被覆盖了。

关于如何解决此问题的任何提示?

1 个答案:

答案 0 :(得分:0)

使用difftime()功能时,可以显式设置单位,例如

dt[, time_diff_3 := difftime(min(end), max(start), units = "secs"), by = .(id)]

导致

   id               start                 end   time_diff_1   time_diff_2  time_diff_3
1:  1 2015-01-01 12:00:00 2016-01-01 12:00:01 31.00001 secs 31536001 secs 2678401 secs
2:  1 2015-12-01 12:00:00 2016-01-01 12:00:01 31.00001 secs  2678401 secs 2678401 secs
3:  2 2019-01-01 12:00:00 2019-01-01 12:00:01  1.00000 secs        1 secs       1 secs
4:  3                <NA> 2019-01-01 12:00:02       NA secs       NA secs      NA secs

,预期结果在列time_diff_3中。

但是,在分组计算之后,data.table 静默如何覆盖单位仍可能有改进的余地。结果导致我头部有些划伤,然后我才发现这些单元弄乱了。