与时间匹配并从数据集中选取不同的值

时间:2017-11-13 11:29:07

标签: r dataframe

我有这样的数据集(整个2016年1月)。分析时间每三个小时进行一次,它可以提前54小时创造价值。

new_data_frame <-
analysis_time       forecast_time       value
2016-01-01 00:00:00 2016-01-01 00:00:00 26918.20
2016-01-01 00:00:00 2016-01-01 01:00:00 28769.50
2016-01-01 00:00:00 2016-01-01 02:00:00 46015.40
2016-01-01 00:00:00 2016-01-01 03:00:00 50000.00
2016-01-01 00:00:00 2016-01-01 03:00:00 45705.50
2016-01-01 00:00:00 2016-01-01 03:00:00 23097.60
..lot of values..
2016-01-01 03:00:00 2016-01-01 03:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 04:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 05:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 06:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 07:00:00 50000.00
..lot of values..
2016-01-01 06:00:00 2016-01-01 06:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 07:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 08:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 09:00:00 50000.00
......

> dput(head(new_data_frame))
structure(list(station_id = c(8L, 8L, 8L, 8L, 8L, 8L), analysis_time = structure(c(1451606400, 
1451606400, 1451606400, 1451606400, 1451606400, 1451606400), class = c("POSIXct", 
"POSIXt"), tzone = "GMT"), forecast_time = structure(c(1451606400, 
1451610000, 1451613600, 1451617200, 1451620800, 1451624400), class = c("POSIXct", 
"POSIXt"), tzone = "GMT"), model_id = c(5L, 5L, 5L, 5L, 5L, 5L
), parameter_id = c(407L, 407L, 407L, 407L, 407L, 407L), value = c(26918.2, 
28769.5, 46015.4, 50000, 45705.5, 23097.6)), .Names = c("station_id", 
"analysis_time", "forecast_time", "model_id", "parameter_id", 
"value"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

由于数据集中缺少某些值,我必须制作一个时间向量(为了将缺失值标记为NA),从2016-01-01到2016-02-01。

table <-
analysis_time
2016-01-01 00:00:00
2016-01-01 00:00:00
2016-01-01 00:00:00
2016-01-01 03:00:00
2016-01-01 03:00:00
2016-01-01 03:00:00
2016-01-01 06:00:00
2016-01-01 06:00:00
2016-01-01 06:00:00
2016-01-01 09:00:00
2016-01-01 09:00:00
2016-01-01 09:00:00
.....

然后我总是想从新的分析时间开始拿起前三个值。

所以它看起来像这样:

analysis_time       forecast_time       value
2016-01-01 00:00:00 2016-01-01 00:00:00 26918.20
2016-01-01 00:00:00 2016-01-01 01:00:00 28769.50
2016-01-01 00:00:00 2016-01-01 02:00:00 46015.40
2016-01-01 03:00:00 2016-01-01 03:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 04:00:00 50000.00
2016-01-01 03:00:00 2016-01-01 05:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 06:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 07:00:00 50000.00
2016-01-01 06:00:00 2016-01-01 08:00:00 50000.00

我的代码现在看起来像这样:

table[,3] <- new_data_frame$value[match(table$analysis_time,new_data_frame$analysis_time)]

但它现在给了我相同的分析时间相同的值,我不想要。我怎么能像上面一样拿起下一个值?

2016-01-01 00:00:00 2016-01-01 00:00:00 26918.200
2016-01-01 00:00:00 2016-01-01 01:00:00 26918.200
2016-01-01 00:00:00 2016-01-01 02:00:00 26918.200
2016-01-01 03:00:00 2016-01-01 03:00:00 50000.000
2016-01-01 03:00:00 2016-01-01 04:00:00 50000.000
2016-01-01 03:00:00 2016-01-01 05:00:00 50000.000
2016-01-01 06:00:00 2016-01-01 06:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 07:00:00 14594.500
2016-01-01 06:00:00 2016-01-01 08:00:00 14594.500

1 个答案:

答案 0 :(得分:0)

如果我正确理解OP的意图,则要求返回属于每个analysis_time的前3行预测值。

这可以通过按new_data_frameanalysis_time排序forecast_time,按analysis_time分组,然后挑选每组中的前三行来解决。

这可以在我更喜欢的基础R,dplyrdata.table中完成:

library(data.table)
setDT(new_data_frame)[order(analysis_time, forecast_time)][
  , head(.SD, 3L), by = analysis_time]
         analysis_time       forecast_time   value
1: 2016-01-01 00:00:00 2016-01-01 00:00:00 26918.2
2: 2016-01-01 00:00:00 2016-01-01 01:00:00 28769.5
3: 2016-01-01 00:00:00 2016-01-01 02:00:00 46015.4
4: 2016-01-01 03:00:00 2016-01-01 03:00:00 50000.0
5: 2016-01-01 03:00:00 2016-01-01 04:00:00 50000.0
6: 2016-01-01 03:00:00 2016-01-01 05:00:00 50000.0
7: 2016-01-01 06:00:00 2016-01-01 06:00:00 14594.5
8: 2016-01-01 06:00:00 2016-01-01 07:00:00 50000.0
9: 2016-01-01 06:00:00 2016-01-01 08:00:00 50000.0

现在,OP已经提到数据集中缺少一些需要显示的值。

这可以通过加入涵盖整个时期的一系列时间戳来实现:

start_time <- as.POSIXct("2016-01-01")
library(lubridate)
time_seq <- seq(start_time, start_time + months(1L), by = "3 hours")
result <- setDT(new_data_frame)[order(analysis_time, forecast_time)][
  , head(.SD, 3L), by = analysis_time][
    .(time_seq), on = .(analysis_time = V1)]
head(result, 12L)
          analysis_time       forecast_time   value
 1: 2016-01-01 00:00:00 2016-01-01 00:00:00 26918.2
 2: 2016-01-01 00:00:00 2016-01-01 01:00:00 28769.5
 3: 2016-01-01 00:00:00 2016-01-01 02:00:00 46015.4
 4: 2016-01-01 03:00:00 2016-01-01 03:00:00 50000.0
 5: 2016-01-01 03:00:00 2016-01-01 04:00:00 50000.0
 6: 2016-01-01 03:00:00 2016-01-01 05:00:00 50000.0
 7: 2016-01-01 06:00:00 2016-01-01 06:00:00 14594.5
 8: 2016-01-01 06:00:00 2016-01-01 07:00:00 50000.0
 9: 2016-01-01 06:00:00 2016-01-01 08:00:00 50000.0
10: 2016-01-01 09:00:00                <NA>      NA
11: 2016-01-01 12:00:00                <NA>      NA
12: 2016-01-01 15:00:00                <NA>      NA

new_data_frame中未显示的分析时间现在由单个NA行表示。