如何对数据帧进行重采样

时间:2019-09-05 12:27:12

标签: python r

数据框在这里

       time            value
0   01-01-2015 00:00    72
1   01-01-2015 01:00    74
2   01-01-2015 02:00    75
3   01-01-2015 03:00    77
4   01-01-2015 06:00    72

如果我在熊猫中传递此数据帧,它将给我24个条目,而丢失的小时数的输出(值)为zero(这也是我想要的)

语法

resample_factor="H"

data_frame = data_frame.resample(resample_factor).mean()

first of all here are some link which was not helpful

here is second

我们可以用R ??吗?

如果可能的话,请建议我该怎么做!

2 个答案:

答案 0 :(得分:1)

也许您正在寻找tidyr::complete来完成缺少的时间。这会创建一个从first时间值开始的24小时的每小时序列。

library(dplyr)

df %>%
  mutate(V2 = as.POSIXct(V2, format = "%d-%m-%Y %H:%M")) %>%
  arrange(V2) %>%
  tidyr::complete(V2 = seq(first(V2), first(V2) + 86400 - (60 * 60),by = "1 hour"), 
                 fill = list(V1 = 0, V3 = 0))


#   V2                     V1    V3
#   <dttm>              <dbl> <dbl>
# 1 2015-01-01 00:00:00     0    72
# 2 2015-01-01 01:00:00     1    74
# 3 2015-01-01 02:00:00     2    75
# 4 2015-01-01 03:00:00     3    77
# 5 2015-01-01 04:00:00     0     0
# 6 2015-01-01 05:00:00     0     0
# 7 2015-01-01 06:00:00     4    72
# 8 2015-01-01 07:00:00     0     0
# 9 2015-01-01 08:00:00     0     0
#10 2015-01-01 09:00:00     0     0
# … with 14 more rows

如果时间不是从00:00开始,我们可以从日期时间中提取日期,并创建一个24小时的序列。

df %>%
  mutate(V2 = as.POSIXct(V2, format = "%d-%m-%Y %H:%M", tz = "GMT")) %>%
  tidyr::complete(V2 = seq(as.POSIXct(as.Date(first(V2))),by = "1 hour", 
 length.out = 24), fill = list(V1 = 0, V3 = 0))

数据

df <- structure(list(V1 = 0:4, V2 = structure(1:5, .Label = c("01-01-201500:00", 
"01-01-201501:00", "01-01-201502:00", "01-01-201503:00", "01-01-201506:00"
), class = "factor"), V3 = c(72L, 74L, 75L, 77L, 72L)), class = 
"data.frame", row.names = c(NA, -5L))

答案 1 :(得分:1)

这是基本的R主意,

dates1 <- seq(as.POSIXct(dd$V2[1], format = '%d-%m-%Y 00:00'), 
              as.POSIXct(dd$V2[1], format = '%d-%m-%Y 00:00') + 82800, 
          by = '1 hour')

merge(transform(dd, V2 = as.POSIXct(V2, format = '%d-%m-%Y %H:%M')),
      data.frame(V2 = dates1), 
      by = 'V2', all = TRUE)

给出,

                    V2 V1 V3
1  2015-01-01 00:00:00  0 72
2  2015-01-01 01:00:00  1 74
3  2015-01-01 02:00:00  2 75
4  2015-01-01 03:00:00  3 77
5  2015-01-01 04:00:00 NA NA
6  2015-01-01 05:00:00 NA NA
7  2015-01-01 06:00:00  4 72
8  2015-01-01 07:00:00 NA NA
9  2015-01-01 08:00:00 NA NA
10 2015-01-01 09:00:00 NA NA
11 2015-01-01 10:00:00 NA NA
12 2015-01-01 11:00:00 NA NA
13 2015-01-01 12:00:00 NA NA
14 2015-01-01 13:00:00 NA NA
15 2015-01-01 14:00:00 NA NA
16 2015-01-01 15:00:00 NA NA
17 2015-01-01 16:00:00 NA NA
18 2015-01-01 17:00:00 NA NA
19 2015-01-01 18:00:00 NA NA
20 2015-01-01 19:00:00 NA NA
21 2015-01-01 20:00:00 NA NA
22 2015-01-01 21:00:00 NA NA
23 2015-01-01 22:00:00 NA NA
24 2015-01-01 23:00:00 NA NA

注意:您可以照常替换NA

数据

dput(dd)
structure(list(V1 = 0:4, V2 = c("01-01-2015 00:00", "01-01-2015 01:00", 
"01-01-2015 02:00", "01-01-2015 03:00", "01-01-2015 06:00"), 
    V3 = c(72L, 74L, 75L, 77L, 72L)), row.names = c(NA, -5L), class = "data.frame")