R - 在10分钟内合并不等规则(15分钟间隔)和不规则时间序列

时间:2015-01-12 22:12:57

标签: r time-series data.table na

修改并更新过帐

我想要合并两个不等的时间序列。 x 包含20,000多行,而 y 包含少于5,000行。 x 的间隔时间间隔为15分钟,而 y 是一个不规则的时间序列。

我希望根据 y 中的日期 - 时间是否在日期的10分钟内 - 时间来结合 x y X 即可。对于那些日期 - x y 中的相同日期,我希望使用下面显示的函数添加行。

之前我错误地写了 x 所以我正在纠正它:

library(data.table)

dput(x)
x <- structure(list(Date = structure(c(1078077600, 1080028800, 1080029700, 
1080030600, 1080031500, 1091220300, 1091221200, 1091222100, 1091223000, 
1091224800, 1091225700, 1091226600, 1091227500, 1091228400), class = 
c("POSIXct", "POSIXt"), tzone = "Etc/GMT-6"), V1 = c(1.6, 1.9, 1.9, 2, 2, 
1.4, 1.4, 1.5, 1.5, 1.6, 2.6, 2.8, 3.4, 3.8), V2 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V3 = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), V4 = c(1.5, 2.3, 2.3, 2.4, 2.4, 7.8, 3.2, 
4.9, 4.7, 3.4, 17.1, 25.4, 16.9, 30.6)), .Names = c("Date", "V1", 
"V2", "V3", "V4"), row.names = c(NA, -14L), class = "data.frame")


#      Date               V1 V2 V3   V4
# 1  2004-03-01 00:00:00 1.6  0  0  1.5
# 2  2004-03-23 14:00:00 1.9  0  0  2.3
# 3  2004-03-23 14:15:00 1.9  0  0  2.3
# 4  2004-03-23 14:30:00 2.0  0  0  2.4
# 5  2004-03-23 14:45:00 2.0  0  0  2.4
# 6  2004-07-31 02:45:00 1.4  0  0  7.8
# 7  2004-07-31 03:00:00 1.4  0  0  3.2
# 8  2004-07-31 03:15:00 1.5  0  0  4.9
# 9  2004-07-31 03:30:00 1.5  0  0  4.7
# 10 2004-07-31 04:00:00 1.6  0  0  3.4
# 11 2004-07-31 04:15:00 2.6  0  0 17.1
# 12 2004-07-31 04:30:00 2.8  0  0 25.4
# 13 2004-07-31 04:45:00 3.4  0  0 16.9
# 14 2004-07-31 05:00:00 3.8  0  0 30.6

dput(y)
y <- structure(list(Date = structure(c(1076902200, 1080029700, 1091221800, 
1091224800, 1091226600), class = c("POSIXct", "POSIXt"), 
tzone = "Etc/GMT-6"), V1 = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_), V2 = c(40, 42, 0, 0, 0), V3 = c(0, 0, 0, 0, 0), V4 = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Date", 
"V1", "V2", "V3", "V4"), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000000000110788>, 
sorted = "Date")

#                   Date V1 V2 V3 V4
# 1: 2004-02-16 09:30:00 NA 40  0 NA
# 2: 2004-03-23 14:15:00 NA 42  0 NA
# 3: 2004-07-31 03:10:00 NA  0  0 NA
# 4: 2004-07-31 04:00:00 NA  0  0 NA
# 5: 2004-07-31 04:30:00 NA  0  0 NA

根据BondedDust在Using `:=` in data.table to sum the values of two columns in R, ignoring NAs中给出的答案,我为&#34; +&#34;写了一个二元运算符。当 x y 中的日期时间相同时。

`%+na%` <- function(x, y) {ifelse(x == 0 & is.na(y) == TRUE, NA, 
 ifelse(x != 0 & is.na(y) == TRUE, x, x+y))}

根据Arun的回答,我有以下代码:

setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll = "nearest", which = TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
xy <- rbind(x, y[yidx, ])

这就是xy在这一点上的样子:

dput(xy)
xy <- structure(list(Date = structure(c(1078077600, 1080028800, 1080029700, 
1080030600, 1080031500, 1091220300, 1091221200, 1091222100, 1091223000, 
1091224800, 1091225700, 1091226600, 1091227500, 1091228400, 1080029700, 
1091221800, 1091224800, 1091226600), class = c("POSIXct", "POSIXt"
), tzone = "Etc/GMT-6"), V1 = c(1.6, 1.9, 1.9, 2, 2, 1.4, 1.4, 
1.5, 1.5, 1.6, 2.6, 2.8, 3.4, 3.8, NA, NA, NA, NA), V2 = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 42, 0, 0, 0), V3 = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V4 = c(1.5, 
2.3, 2.3, 2.4, 2.4, 7.8, 3.2, 4.9, 4.7, 3.4, 17.1, 25.4, 16.9, 
30.6, NA, NA, NA, NA)), .Names = c("Date", "V1", "V2", "V3", 
"V4"), row.names = c(NA, -18L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000000200788>)

#                   Date  V1 V2 V3   V4
# 1: 2004-03-01 00:00:00 1.6  0  0  1.5
# 2: 2004-03-23 14:00:00 1.9  0  0  2.3
# 3: 2004-03-23 14:15:00 1.9  0  0  2.3
# 4: 2004-03-23 14:30:00 2.0  0  0  2.4
# 5: 2004-03-23 14:45:00 2.0  0  0  2.4
# 6: 2004-07-31 02:45:00 1.4  0  0  7.8
# 7: 2004-07-31 03:00:00 1.4  0  0  3.2
# 8: 2004-07-31 03:15:00 1.5  0  0  4.9
# 9: 2004-07-31 03:30:00 1.5  0  0  4.7
# 10: 2004-07-31 04:00:00 1.6  0  0  3.4
# 11: 2004-07-31 04:15:00 2.6  0  0 17.1
# 12: 2004-07-31 04:30:00 2.8  0  0 25.4
# 13: 2004-07-31 04:45:00 3.4  0  0 16.9
# 14: 2004-07-31 05:00:00 3.8  0  0 30.6
# 15: 2004-03-23 14:15:00  NA 42  0   NA
# 16: 2004-07-31 03:10:00  NA  0  0   NA
# 17: 2004-07-31 04:00:00  NA  0  0   NA
# 18: 2004-07-31 04:30:00  NA  0  0   NA

下一行是根据akrun提供的解决方案修改的:Identifying duplicated rows

xy[, lapply(.SD, xy[which(duplicated(xy))] %+na% xy[which(duplicated(xy, 
fromLast = TRUE))]), keyby = Date]

有人可以建议对最后一行进行修改,以便我不会收到此错误消息吗?:

# Note the new error message that I am receiving:
# Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), 
# nrow = nr, : length of 'dimnames' [2] not equal to array extent 

提前谢谢。

这是下面的最终预期结果(已经更改,请参阅第8行与NA):

dput(xy)
xy <- structure(list(Date = structure(c(1078077600, 1080028800, 1080029700, 
1080030600, 1080031500, 1091220300, 1091221200, 1091221800, 1091222100, 
1091223000, 1091224800, 1091225700, 1091226600, 1091227500, 1091228400
), class = c("POSIXct", "POSIXt"), tzone = "Etc/GMT-6"), V1 = c(1.6, 
1.9, 1.9, 2, 2, 1.4, 1.4, NA, 1.5, 1.5, 1.6, 2.6, 2.8, 3.4, 3.8
), V2 = c(0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V3 = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V4 = c(1.5, 2.3, 2.3, 
2.4, 2.4, 7.8, 3.2, NA, 4.9, 4.7, 3.4, 17.1, 25.4, 16.9, 30.6
)), .Names = c("Date", "V1", "V2", "V3", "V4"), row.names = c(NA, 
-15L), class = "data.frame")


#                  Date  V1 V2 V3   V4
# 1  2004-03-01 00:00:00 1.6  0  0  1.5
# 2  2004-03-23 14:00:00 1.9  0  0  2.3
# 3  2004-03-23 14:15:00 1.9 42  0  2.3
# 4  2004-03-23 14:30:00 2.0  0  0  2.4
# 5  2004-03-23 14:45:00 2.0  0  0  2.4
# 6  2004-07-31 02:45:00 1.4  0  0  7.8
# 7  2004-07-31 03:00:00 1.4  0  0  3.2
# 8  2004-07-31 03:10:00  NA  0  0   NA  <-- Notice: the change here
# 9  2004-07-31 03:15:00 1.5  0  0  4.9
# 10 2004-07-31 03:30:00 1.5  0  0  4.7
# 11 2004-07-31 04:00:00 1.6  0  0  3.4
# 12 2004-07-31 04:15:00 2.6  0  0 17.1
# 13 2004-07-31 04:30:00 2.8  0  0 25.4
# 14 2004-07-31 04:45:00 3.4  0  0 16.9
# 15 2004-07-31 05:00:00 3.8  0  0 30.6

xy 中的NA列将使用na.approx进行插值,从而更改此处发布的原始问题。


更新可能的解决方案

以下代码部分取自Arun的答案:

setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll = "nearest", which = TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
xy <- rbind(x, y[yidx, ])
setkey(xy, Date)

a和b都来自Identifying duplicated rows

a <- which(duplicated(xy, fromLast = TRUE))
b <- which(duplicated(xy))

有人可以提供更好的方法来继续执行以下步骤吗?

xyadd <- vector("list", length(a)) # pre-allocate the list since it is   
# being used in a for loop / Sources: 

Add a Column to a Dataframe From a List of ValuesWhy does is.vector() return TRUE for list?

for(u in seq(a))
{
xyadd[[u]] <- xy[a[u], .SD, .SDcols = 2:5] %+na% xy[b[u], .SD, .SDcols = 2:5]
}

xyadd2 <- data.frame(unlist(xyadd))
xyadd2 <- ifelse(nrow(xyadd2) > prod(length(a)*4), xyadd2 <- 
data.frame(xyadd2[-nrow(xyadd2), ]), xyadd2) 
# 4 comes from the 4 columns that I have
xyadd2 <- xyadd2[1][[1]]
xyadd2 <- matrix(data = xyadd2, nrow = length(a), ncol = 4, byrow = TRUE)
xyadd2 <- as.data.frame(xyadd2)
xyadd2 <- setDT(xyadd2)

xy[a, `:=` (V1 = xyadd2[, V1], V2 = xyadd2[, V2], V3 = xyadd2[, V3], 
V4 = xyadd2[, V4])]
xy <- xy[-b, ]

如上所示,我得到了相同的 xy

我没有将上面的潜在解决方案作为答案发布,因为我希望收到有关提高代码效率的反馈意见。数据集 x 包含20,000多行,而数据集 y 包含少于5,000行。该解决方案需要应用于大约20个文件集。

非常感谢任何帮助。

提前谢谢。

3 个答案:

答案 0 :(得分:2)

dplyr解决方案

A <- expand.grid(y$Date, x$Date) #all possible combination of dates
#indices of y, of which the time diff to x is less than 10 min
ind <- which(abs(A$Var1-A$Var2)<10*60) %% nrow(y)
ind[ind==0] <- nrow(y) 
y1 <- y[ind, ] #dump the obsolete values

library(dplyr)
bind_rows(x, y1) %>%  #alternative to rbind(x,y1)
  group_by(Date) %>%
  summarise_each(funs(sum))     
#Source: local data frame [15 x 5]
#
#                  Date  V1 V2 V3   V4
#1  2004-03-01 00:00:00 1.6  0  0  1.5
#2  2004-03-23 14:00:00 1.9  0  0  2.3
#3  2004-03-23 14:15:00 1.9 42  0  2.3
#4  2004-03-23 14:30:00 2.0  0  0  2.4
#5  2004-03-23 14:45:00 2.0  0  0  2.4
#6  2004-07-31 02:45:00 1.4  0  0  7.8
#7  2004-07-31 03:00:00 1.4  0  0  3.2
#8  2004-07-31 03:10:00 0.0  0  0  0.0
#9  2004-07-31 03:15:00 1.5  0  0  4.9
#10 2004-07-31 03:30:00 1.5  0  0  4.7
#11 2004-07-31 04:00:00 1.6  0  0  3.4
#12 2004-07-31 04:15:00 2.6  0  0 17.1
#13 2004-07-31 04:30:00 2.8  0  0 25.4
#14 2004-07-31 04:45:00 3.4  0  0 16.9
#15 2004-07-31 05:00:00 3.8  0  0 30.6

答案 1 :(得分:2)

使用data.tabe的滚动连接:

require(data.table)
setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll="nearest", which=TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
ans = rbind(x, y[yidx]) ## uses faster 'rbind.data.table'
ans[, lapply(.SD, sum), keyby=Date]

我解决问题的方法如下:

使用x列上的y加入roll="nearest"最近值,获取Date的所有索引。然后,在y中找到最近匹配日期最多相隔10分钟的那些索引。使用它来分组y并将其与x绑定,然后聚合。

答案 2 :(得分:1)

主要基于Arun的回答,我创建了以下data.table解决方案。

setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll = "nearest", which = TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
xy <- rbind(x, y[yidx, ]) # In this line, I have added the comma after yidx, 
# which is the only difference from Arun's solution.
xy <- xy[, lapply(.SD, sum), keyby = Date]

<强>更新

根据Arun的评论,我在下面修改了这个解决方案。谢谢阿伦。

# xy
#                  Date  V1 V2 V3   V4
# 1: 2004-03-01 00:00:00 1.6  0  0  1.5
# 2: 2004-03-23 14:00:00 1.9  0  0  2.3
# 3: 2004-03-23 14:15:00 1.9 42  0  2.3
# 4: 2004-03-23 14:30:00 2.0  0  0  2.4
# 5: 2004-03-23 14:45:00 2.0  0  0  2.4
# 6: 2004-07-31 02:45:00 1.4  0  0  7.8
# 7: 2004-07-31 03:00:00 1.4  0  0  3.2
# 8: 2004-07-31 03:10:00 0.0  0  0  0.0
# 9: 2004-07-31 03:15:00 1.5  0  0  4.9
# 10: 2004-07-31 03:30:00 1.5  0  0  4.7
# 11: 2004-07-31 04:00:00 1.6  0  0  3.4
# 12: 2004-07-31 04:15:00 2.6  0  0 17.1
# 13: 2004-07-31 04:30:00 2.8  0  0 25.4
# 14: 2004-07-31 04:45:00 3.4  0  0 16.9
# 15: 2004-07-31 05:00:00 3.8  0  0 30.6