我有这样的数据集(整个2016年1月)。分析时间每三个小时进行一次,它可以提前54小时创造价值。
new_data_frame <-
analysis_time forecast_time value
2016-01-01 00:00:00 2016-01-01 00:00:00 26918.20
2016-01-01 00:00:00 2016-01-01 01:00:00 28769.50
2016-01-01 00:00:00 2016-01-01 02:00:00 46015.40
2016-01-01 00:00:00 2016-01-01 03:00:00 50000.00
2016-01-01 00:00:00 2016-01-01 03:00:00 45705.50
2016-01-01 00:00:00 2016-01-01 03:00:00 23097.60
..lot of values..
2016-01-01 03:00:00 2016-01-01 03:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 04:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 05:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 06:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 07:00:00 50000.00
..lot of values..
2016-01-01 06:00:00 2016-01-01 06:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 07:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 08:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 09:00:00 50000.00
......
> dput(head(new_data_frame))
structure(list(station_id = c(8L, 8L, 8L, 8L, 8L, 8L), analysis_time = structure(c(1451606400,
1451606400, 1451606400, 1451606400, 1451606400, 1451606400), class = c("POSIXct",
"POSIXt"), tzone = "GMT"), forecast_time = structure(c(1451606400,
1451610000, 1451613600, 1451617200, 1451620800, 1451624400), class = c("POSIXct",
"POSIXt"), tzone = "GMT"), model_id = c(5L, 5L, 5L, 5L, 5L, 5L
), parameter_id = c(407L, 407L, 407L, 407L, 407L, 407L), value = c(26918.2,
28769.5, 46015.4, 50000, 45705.5, 23097.6)), .Names = c("station_id",
"analysis_time", "forecast_time", "model_id", "parameter_id",
"value"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
由于数据集中缺少某些值,我必须制作一个时间向量(为了将缺失值标记为NA),从2016-01-01到2016-02-01。
table <-
analysis_time
2016-01-01 00:00:00
2016-01-01 00:00:00
2016-01-01 00:00:00
2016-01-01 03:00:00
2016-01-01 03:00:00
2016-01-01 03:00:00
2016-01-01 06:00:00
2016-01-01 06:00:00
2016-01-01 06:00:00
2016-01-01 09:00:00
2016-01-01 09:00:00
2016-01-01 09:00:00
.....
然后我总是想从新的分析时间开始拿起前三个值。
所以它看起来像这样:
analysis_time forecast_time value
2016-01-01 00:00:00 2016-01-01 00:00:00 26918.20
2016-01-01 00:00:00 2016-01-01 01:00:00 28769.50
2016-01-01 00:00:00 2016-01-01 02:00:00 46015.40
2016-01-01 03:00:00 2016-01-01 03:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 04:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 05:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 06:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 07:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 08:00:00 50000.00
我的代码现在看起来像这样:
table[,3] <- new_data_frame$value[match(table$analysis_time,new_data_frame$analysis_time)]
但它现在给了我相同的分析时间相同的值,我不想要。我怎么能像上面一样拿起下一个值?
2016-01-01 00:00:00 2016-01-01 00:00:00 26918.200
2016-01-01 00:00:00 2016-01-01 01:00:00 26918.200
2016-01-01 00:00:00 2016-01-01 02:00:00 26918.200
2016-01-01 03:00:00 2016-01-01 03:00:00 50000.000
2016-01-01 03:00:00 2016-01-01 04:00:00 50000.000
2016-01-01 03:00:00 2016-01-01 05:00:00 50000.000
2016-01-01 06:00:00 2016-01-01 06:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 07:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 08:00:00 14594.500
答案 0 :(得分:0)
如果我正确理解OP的意图,则要求返回属于每个analysis_time
的前3行预测值。
这可以通过按new_data_frame
和analysis_time
排序forecast_time
,按analysis_time
分组,然后挑选每组中的前三行来解决。
这可以在我更喜欢的基础R,dplyr
或data.table
中完成:
library(data.table)
setDT(new_data_frame)[order(analysis_time, forecast_time)][
, head(.SD, 3L), by = analysis_time]
analysis_time forecast_time value 1: 2016-01-01 00:00:00 2016-01-01 00:00:00 26918.2 2: 2016-01-01 00:00:00 2016-01-01 01:00:00 28769.5 3: 2016-01-01 00:00:00 2016-01-01 02:00:00 46015.4 4: 2016-01-01 03:00:00 2016-01-01 03:00:00 50000.0 5: 2016-01-01 03:00:00 2016-01-01 04:00:00 50000.0 6: 2016-01-01 03:00:00 2016-01-01 05:00:00 50000.0 7: 2016-01-01 06:00:00 2016-01-01 06:00:00 14594.5 8: 2016-01-01 06:00:00 2016-01-01 07:00:00 50000.0 9: 2016-01-01 06:00:00 2016-01-01 08:00:00 50000.0
现在,OP已经提到数据集中缺少一些需要显示的值。
这可以通过加入涵盖整个时期的一系列时间戳来实现:
start_time <- as.POSIXct("2016-01-01")
library(lubridate)
time_seq <- seq(start_time, start_time + months(1L), by = "3 hours")
result <- setDT(new_data_frame)[order(analysis_time, forecast_time)][
, head(.SD, 3L), by = analysis_time][
.(time_seq), on = .(analysis_time = V1)]
head(result, 12L)
analysis_time forecast_time value 1: 2016-01-01 00:00:00 2016-01-01 00:00:00 26918.2 2: 2016-01-01 00:00:00 2016-01-01 01:00:00 28769.5 3: 2016-01-01 00:00:00 2016-01-01 02:00:00 46015.4 4: 2016-01-01 03:00:00 2016-01-01 03:00:00 50000.0 5: 2016-01-01 03:00:00 2016-01-01 04:00:00 50000.0 6: 2016-01-01 03:00:00 2016-01-01 05:00:00 50000.0 7: 2016-01-01 06:00:00 2016-01-01 06:00:00 14594.5 8: 2016-01-01 06:00:00 2016-01-01 07:00:00 50000.0 9: 2016-01-01 06:00:00 2016-01-01 08:00:00 50000.0 10: 2016-01-01 09:00:00 <NA> NA 11: 2016-01-01 12:00:00 <NA> NA 12: 2016-01-01 15:00:00 <NA> NA
new_data_frame
中未显示的分析时间现在由单个NA
行表示。