如何在data.table中的组内创建连续的时间序列?

时间:2017-09-13 16:45:56

标签: r data.table

我有一个data.table包含来自不同位置(站点)的每小时观察时间序列。每个序列都有缺口 - 缺少小时数。我想填写每个站点的小时序列,因此每个序列每小时都有一行(尽管数据将丢失,NA)。

示例数据:

library(data.table)
library(lubridate)

DT <- data.table(site = rep(LETTERS[1:2], each = 3),
                 date = ymd_h(c("2017080101", "2017080103", "2017080105",
                                "2017080103", "2017080105", "2017080107")),
                 # x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3), 
                 x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 
                 key = c("site", "date"))
DT
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 03:00:00 1.2
# 3:    A 2017-08-01 05:00:00 1.3
# 4:    B 2017-08-01 03:00:00 2.1
# 5:    B 2017-08-01 05:00:00 2.2
# 6:    B 2017-08-01 07:00:00 2.3

所需结果DT2将包含每个站点的第一个(最小)日期和最后一个(最大)日期之间的所有小时数,其中新行插入的位置缺少x:

#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3

我尝试加入DT,其日期序列由min(date)max(date)构成。这是正确的方向,但日期范围是在所有站点而不是每个站点,填充的行中缺少站点,排序顺序(键)是错误的:

DT[.(seq(from = min(date), to = max(date), by = "hour")),
    .SD, on="date"]
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:   NA 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    B 2017-08-01 03:00:00 2.1
# 5:   NA 2017-08-01 04:00:00  NA
# 6:    A 2017-08-01 05:00:00 1.3
# 7:    B 2017-08-01 05:00:00 2.2
# 8:   NA 2017-08-01 06:00:00  NA
# 9:    B 2017-08-01 07:00:00 2.3

所以我自然而然地尝试添加by = site

DT[.(seq(from = min(date), to = max(date), by = "hour")),
   .SD, on="date", by=.(site)]
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 03:00:00 1.2
# 3:    A 2017-08-01 05:00:00 1.3
# 4:   NA                <NA>  NA
# 5:    B 2017-08-01 03:00:00 2.1
# 6:    B 2017-08-01 05:00:00 2.2
# 7:    B 2017-08-01 07:00:00 2.3

但这也不起作用。任何人都可以建议使用正确的data.table公式来提供上面显示的所需填写DT2吗?

2 个答案:

答案 0 :(得分:2)

library(data.table)
library(lubridate)  
setDT(DT)
test <- DT[, .(date = seq(min(date), max(date), by = 'hour')), by = 
              'site']
DT <- merge(test, DT, by = c('site', 'date'), all.x = TRUE)


DT
    site                date   x
 1:    A 2017-08-01 01:00:00 1.1
 2:    A 2017-08-01 02:00:00  NA
 3:    A 2017-08-01 03:00:00 1.2
 4:    A 2017-08-01 04:00:00  NA
 5:    A 2017-08-01 05:00:00 1.3
 6:    B 2017-08-01 03:00:00 2.1
 7:    B 2017-08-01 04:00:00  NA
 8:    B 2017-08-01 05:00:00 2.2
 9:    B 2017-08-01 06:00:00  NA
10:    B 2017-08-01 07:00:00 2.3

答案 1 :(得分:1)

感谢Frank和Wen让我走上正轨。我发现了一个紧凑的data.table解决方案。结果DT2也在站点和日期上键入,如输入表中所示(虽然我没有在OP中请求这个,但这是合乎需要的)。这是Wen的解决方案的重新制定,采用data.table语法,我认为这对大型数据集的效率会略高一些。

DT2 <- DT[setkey(DT[, .(date = seq(from = min(date), to = max(date), 
                         by = "hour")), by = site], site, date), ]
DT2
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT2)
# [1] "site" "date"

EDIT1:正如Frank所提到的,也可以使用on=语法。以下DT3表达式给出了正确的答案,但DT3没有键入,而DT2结果是键控的。这意味着额外的&#39;如果需要关键结果,则需要setkey()

DT3 <- DT[DT[, .(date = seq(from = min(date), to = max(date), 
                  by = "hour")), by = site], on = c("site", "date"), ]
DT3
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT3)
# NULL
all.equal(DT2, DT3)
# [1] "Datasets has different keys. 'target': site, date. 'current' has no key."
all.equal(DT2, DT3, check.attributes = FALSE)
# [1] TRUE

除了明确使用DT3之外,有没有办法编写setkey()表达式来提供键控结果?

EDIT2:弗兰克的评论建议使用DT4的其他表述keyby = .EACHI。在这种情况下,.SD作为j插入,这在使用bykeyby时是必需的。这给出了正确的答案,结果像DT2表达式一样被键入。

DT4 <- DT[DT[, .(date = seq(from = min(date), to = max(date), by = "hour")), 
             by = site], on = c("site", "date"), .SD, keyby = .EACHI]
DT4
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT4)
# [1] "site" "date"
identical(DT2, DT4)
# [1] TRUE