在列表R中插入并填充缺少日期的行

时间:2016-04-01 03:41:55

标签: r datetime merge time-series missing-data

目前,我在列表中有多个数据帧,格式如下:

             datetime precip code
1 2015-04-15 00:00:00     NA    M
2 2015-04-15 01:00:00     NA    M
3 2015-04-15 02:00:00     NA    M
4 2015-04-15 03:00:00     NA    M
5 2015-04-15 04:00:00     NA    M
6 2015-04-15 05:00:00     NA    M

每个数据框都有不同的开始和结束日期,但我希望每个数据框都从2015-04-01 0:00:00开始到2015-11-30 23:59:59。我想为每个数据框中的datetime生成缺少日期的行,并使用precip填充NA列,以便我在每个列中都有nrow=5856的连续时间序列数据帧。

忽略code列。如果precip存在值,请勿更改它们,只需使用datetime

填充其他rows NAs

到目前为止,我的尝试产生错误:

library(dplyr)
dates <- seq.POSIXt(as.POSIXlt("2015-04-01 0:00:00"), as.POSIXlt("2015-11-30 23:59:59"), by="hour",tz="GMT")
ts <- format.POSIXct(dates,"%Y/%m/%d %H:%M")
df <- data.frame(datetime=ts)
dat=mylist
final_list <- lapply(dat, function(x) full_join(df,dat$precip))

Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"

link to sample file in case it is needed

感谢您的建议。

1 个答案:

答案 0 :(得分:1)

正如vitor在上面指出的那样,你只能加入两个data.frames,而不是data.frame和vector。 dplyr也适用于POSIXct,但不是POSIXlt(Hadley有偏好),因此如果您将数据存储为实际时间,则可以更轻松地加入。

此外,在lapply内,您需要使用您创建的函数的变量(此处为x),或者您只是重复同样的事情。如果要加入data.frames,也不要对其进行子集化;你需要一个具有相同名称和数据类型的列。

总之,你需要这样的东西:

library(dplyr)

df$datetime <- as.POSIXct(df$datetime, tz = "GMT")
df <- tbl_df(df)    # not necessary, but prints nicely

list_df <- list(df, df)    # fake list of data.frames
# make a data.frame of sequence to join on
seq_df <- data_frame(datetime = seq.POSIXt(as.POSIXct("2015-04-01 0:00:00", tz = 'GMT'), 
                                           as.POSIXct("2015-11-30 23:59:59", tz = 'GMT'), 
                                           by="hour",tz="GMT"))

lapply(list_df, function(x){full_join(x, seq_df)})
# Joining by: "datetime"
# Joining by: "datetime"
# [[1]]
# Source: local data frame [5,857 x 3]
# 
#               datetime precip   code
#                 (POSI)  (lgl) (fctr)
# 1  2015-04-15 00:00:00     NA      M
# 2  2015-04-15 01:00:00     NA      M
# 3  2015-04-15 02:00:00     NA      M
# 4  2015-04-15 03:00:00     NA      M
# 5  2015-04-15 04:00:00     NA      M
# 6  2015-04-15 05:00:00     NA      M
# 7  2015-04-01 04:00:00     NA     NA
# 8  2015-04-01 05:00:00     NA     NA
# 9  2015-04-01 06:00:00     NA     NA
# 10 2015-04-01 07:00:00     NA     NA
# ..                 ...    ...    ...
# 
# [[2]]
# Source: local data frame [5,857 x 3]
# 
#               datetime precip   code
#                 (POSI)  (lgl) (fctr)
# 1  2015-04-15 00:00:00     NA      M
# 2  2015-04-15 01:00:00     NA      M
# 3  2015-04-15 02:00:00     NA      M
# 4  2015-04-15 03:00:00     NA      M
# 5  2015-04-15 04:00:00     NA      M
# 6  2015-04-15 05:00:00     NA      M
# 7  2015-04-01 04:00:00     NA     NA
# 8  2015-04-01 05:00:00     NA     NA
# 9  2015-04-01 06:00:00     NA     NA
# 10 2015-04-01 07:00:00     NA     NA
# ..                 ...    ...    ...

数据:

df <- structure(list(datetime = structure(c(1429056000, 1429059600, 1429063200, 1429066800, 
    1429070400, 1429074000), class = c("POSIXct", "POSIXt"), tzone = "GMT"), precip = c(NA, 
    NA, NA, NA, NA, NA), code = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "M", 
    class = "factor")), .Names = c("datetime", "precip", "code"), row.names = c("1", 
    "2", "3", "4", "5", "6"), class = c("tbl_df", "tbl", "data.frame"))