我有一个很大的连续监视的日期时间列,我需要将其分成半小时。
我尝试使用一些r
data.table
代码将它们分开,但是问题仍然存在于从一个时期到另一个时期的过渡时期。
下面的df
数据框是此数据的最小玩具示例。
library(data.table)
library(lubridate)
driver = rep(c("foo", "bar"), each = 10L)
dt = ymd_hm(c(
"2015-05-27 07:11", "2015-05-27 07:25", "2015-05-27 07:35",
"2015-05-27 07:42", "2015-05-27 07:53",
"2015-05-27 08:09", "2015-05-27 08:23", "2015-05-27 08:39",
"2015-05-27 08:52", "2015-05-27 09:12",
"2015-05-27 16:12", "2015-05-27 16:31", "2015-05-27 16:39",
"2015-05-27 16:53", "2015-05-27 17:29",
"2015-05-27 17:41", "2015-05-27 17:58", "2015-05-27 18:09",
"2015-05-27 18:23", "2015-05-27 18:42")
)
df = data.table(driver, dt)
我尝试了以下代码将它们分开:
df[,diff := as.integer(difftime(dt, shift(dt, 1), units = "mins")),
by = driver]
df[, diff := {diff[1] = 0L; diff}, driver]
df[,cum_mins := cumsum(diff), driver]
df[,cum_halfhour := round(cum_mins/30, 3), driver]
df[,flag := floor(cum_halfhour), driver]
结果表为
> df
driver dt diff cum_mins cum_halfhour flag
1: foo 2015-05-27 07:11:00 0 0 0.000 0
2: foo 2015-05-27 07:25:00 14 14 0.467 0
3: foo 2015-05-27 07:35:00 10 24 0.800 0
4: foo 2015-05-27 07:42:00 7 31 1.033 1
5: foo 2015-05-27 07:53:00 11 42 1.400 1
6: foo 2015-05-27 08:09:00 16 58 1.933 1
7: foo 2015-05-27 08:23:00 14 72 2.400 2
8: foo 2015-05-27 08:39:00 16 88 2.933 2
9: foo 2015-05-27 08:52:00 13 101 3.367 3
10: foo 2015-05-27 09:12:00 20 121 4.033 4
11: bar 2015-05-27 16:12:00 0 0 0.000 0
12: bar 2015-05-27 16:31:00 19 19 0.633 0
13: bar 2015-05-27 16:39:00 8 27 0.900 0
14: bar 2015-05-27 16:53:00 14 41 1.367 1
15: bar 2015-05-27 17:29:00 36 77 2.567 2
16: bar 2015-05-27 17:41:00 12 89 2.967 2
17: bar 2015-05-27 17:58:00 17 106 3.533 3
18: bar 2015-05-27 18:09:00 11 117 3.900 3
19: bar 2015-05-27 18:23:00 14 131 4.367 4
20: bar 2015-05-27 18:42:00 19 150 5.000 5
我想要flag
列,但不完全是。该问题发生在flag
之间的过渡行上。例如,在第3行和第4行,我希望算法将第4行标记为0,因为第4行比第3行更接近半小时点(cum_halfhour是31,而24是) 。第9行和第10行仍然存在相同的问题。
当前算法的问题在于,它总是落在上,直到30分钟为止。但实际上,时间间隔是不规则的,因此将分界点实际设置在最近的30分钟点更为合理。如上面第3行和第4行示例中所述。
解决方案可能很简单,但我无法提出。对实现此算法有什么建议吗?谢谢!
答案 0 :(得分:1)
再三考虑,这里确实不需要滚动连接:
首先,生成数据(这里不需要真正使用lubridate
,带有正确格式字符串的as.POSIXct
可以正常工作)。
library(data.table)
driver = rep(c("foo", "bar"), each = 10L)
dt = as.POSIXct(c(
"2015-05-27 07:11", "2015-05-27 07:25", "2015-05-27 07:35",
"2015-05-27 07:42", "2015-05-27 07:53",
"2015-05-27 08:09", "2015-05-27 08:23", "2015-05-27 08:39",
"2015-05-27 08:52", "2015-05-27 09:12",
"2015-05-27 16:12", "2015-05-27 16:31", "2015-05-27 16:39",
"2015-05-27 16:53", "2015-05-27 17:29",
"2015-05-27 17:41", "2015-05-27 17:58", "2015-05-27 18:09",
"2015-05-27 18:23", "2015-05-27 18:42")
, format = "%F %H:%M", tz = "America/Chicago")
df = data.table(driver, dt)
以下操作应能满足您的要求:
## Create a column with epoch time so we don't have to worry about
## some of the idiosyncracies of the R `difftime` class
df[,dt_epoch := as.integer(dt)]
## Create a cum_halfhour column based on epoch time
df[,cum_halfhour := round((dt_epoch - min(dt_epoch))/1800,3), by = .(driver)]
## Create a rounded version
df[,nearest_half := round((dt_epoch - min(dt_epoch))/1800,0), by = .(driver)]
## Create a flag for changes using `data.table::rleid` for each driver
df[,flag := rleid(nearest_half) - 1L, by = .(driver)]
df
# driver dt dt_epoch cum_halfhour nearest_half flag
# 1: foo 2015-05-27 07:11:00 1432728660 0.000 0 0
# 2: foo 2015-05-27 07:25:00 1432729500 0.467 0 0
# 3: foo 2015-05-27 07:35:00 1432730100 0.800 1 1
# 4: foo 2015-05-27 07:42:00 1432730520 1.033 1 1
# 5: foo 2015-05-27 07:53:00 1432731180 1.400 1 1
# 6: foo 2015-05-27 08:09:00 1432732140 1.933 2 2
# 7: foo 2015-05-27 08:23:00 1432732980 2.400 2 2
# 8: foo 2015-05-27 08:39:00 1432733940 2.933 3 3
# 9: foo 2015-05-27 08:52:00 1432734720 3.367 3 3
# 10: foo 2015-05-27 09:12:00 1432735920 4.033 4 4
# 11: bar 2015-05-27 16:12:00 1432761120 0.000 0 0
# 12: bar 2015-05-27 16:31:00 1432762260 0.633 1 1
# 13: bar 2015-05-27 16:39:00 1432762740 0.900 1 1
# 14: bar 2015-05-27 16:53:00 1432763580 1.367 1 1
# 15: bar 2015-05-27 17:29:00 1432765740 2.567 3 2
# 16: bar 2015-05-27 17:41:00 1432766460 2.967 3 2
# 17: bar 2015-05-27 17:58:00 1432767480 3.533 4 3
# 18: bar 2015-05-27 18:09:00 1432768140 3.900 4 3
# 19: bar 2015-05-27 18:23:00 1432768980 4.367 4 3
# 20: bar 2015-05-27 18:42:00 1432770120 5.000 5 4
以前发布(过于复杂)的操作步骤:
## Create a column with epoch time so we don't have to worry about
## some of the idiosyncracies of the R `difftime` class
df[,dt_epoch := as.integer(dt)]
## Create a cum_halfhour column based on epoch time
df[,cum_halfhour := round((dt_epoch - min(dt_epoch))/1800,3), by = .(driver)]
## Create a lookup table with all the possible half hour increments for each driver
Lookup <- df[,.(half_points = seq(from = 0,
to = max(cum_halfhour),
by = 1)), by = .(driver)]
## Create a copy of the target half_points column since the join process
## treats the keys in a way that makes the join columns complicated to access
Lookup[,join_half_points := half_points]
## Set keys on our original table and the Lookup table
setkey(df,driver,cum_halfhour)
setkey(Lookup,driver,join_half_points)
## This one is a doozy. To get an idea of what we're assigning to the
## `half_point` column, run `Lookup[df, roll = "nearest"]`
## to see the table generated by the rolling join. We then pull
## the column `half_points` out of the joined result and assign it to the
## original `df` as a new column,
df[,half_point := Lookup[df,half_points, roll = "nearest"]]
## Create a flag using `data.table::rleid` for each driver
df[,flag := rleid(half_point) - 1L, by = .(driver)]