修改并更新过帐
我想要合并两个不等的时间序列。 x 包含20,000多行,而 y 包含少于5,000行。 x 的间隔时间间隔为15分钟,而 y 是一个不规则的时间序列。
我希望根据 y 中的日期 - 时间是否在日期的10分钟内 - 时间来结合 x 和 y X 即可。对于那些日期 - x 和 y 中的相同日期,我希望使用下面显示的函数添加行。
之前我错误地写了 x 所以我正在纠正它:
library(data.table)
dput(x)
x <- structure(list(Date = structure(c(1078077600, 1080028800, 1080029700,
1080030600, 1080031500, 1091220300, 1091221200, 1091222100, 1091223000,
1091224800, 1091225700, 1091226600, 1091227500, 1091228400), class =
c("POSIXct", "POSIXt"), tzone = "Etc/GMT-6"), V1 = c(1.6, 1.9, 1.9, 2, 2,
1.4, 1.4, 1.5, 1.5, 1.6, 2.6, 2.8, 3.4, 3.8), V2 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V3 = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), V4 = c(1.5, 2.3, 2.3, 2.4, 2.4, 7.8, 3.2,
4.9, 4.7, 3.4, 17.1, 25.4, 16.9, 30.6)), .Names = c("Date", "V1",
"V2", "V3", "V4"), row.names = c(NA, -14L), class = "data.frame")
# Date V1 V2 V3 V4
# 1 2004-03-01 00:00:00 1.6 0 0 1.5
# 2 2004-03-23 14:00:00 1.9 0 0 2.3
# 3 2004-03-23 14:15:00 1.9 0 0 2.3
# 4 2004-03-23 14:30:00 2.0 0 0 2.4
# 5 2004-03-23 14:45:00 2.0 0 0 2.4
# 6 2004-07-31 02:45:00 1.4 0 0 7.8
# 7 2004-07-31 03:00:00 1.4 0 0 3.2
# 8 2004-07-31 03:15:00 1.5 0 0 4.9
# 9 2004-07-31 03:30:00 1.5 0 0 4.7
# 10 2004-07-31 04:00:00 1.6 0 0 3.4
# 11 2004-07-31 04:15:00 2.6 0 0 17.1
# 12 2004-07-31 04:30:00 2.8 0 0 25.4
# 13 2004-07-31 04:45:00 3.4 0 0 16.9
# 14 2004-07-31 05:00:00 3.8 0 0 30.6
dput(y)
y <- structure(list(Date = structure(c(1076902200, 1080029700, 1091221800,
1091224800, 1091226600), class = c("POSIXct", "POSIXt"),
tzone = "Etc/GMT-6"), V1 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), V2 = c(40, 42, 0, 0, 0), V3 = c(0, 0, 0, 0, 0), V4 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Date",
"V1", "V2", "V3", "V4"), row.names = c(NA, -5L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000110788>,
sorted = "Date")
# Date V1 V2 V3 V4
# 1: 2004-02-16 09:30:00 NA 40 0 NA
# 2: 2004-03-23 14:15:00 NA 42 0 NA
# 3: 2004-07-31 03:10:00 NA 0 0 NA
# 4: 2004-07-31 04:00:00 NA 0 0 NA
# 5: 2004-07-31 04:30:00 NA 0 0 NA
根据BondedDust在Using `:=` in data.table to sum the values of two columns in R, ignoring NAs中给出的答案,我为&#34; +&#34;写了一个二元运算符。当 x 和 y 中的日期时间相同时。
`%+na%` <- function(x, y) {ifelse(x == 0 & is.na(y) == TRUE, NA,
ifelse(x != 0 & is.na(y) == TRUE, x, x+y))}
根据Arun的回答,我有以下代码:
setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll = "nearest", which = TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
xy <- rbind(x, y[yidx, ])
这就是xy在这一点上的样子:
dput(xy)
xy <- structure(list(Date = structure(c(1078077600, 1080028800, 1080029700,
1080030600, 1080031500, 1091220300, 1091221200, 1091222100, 1091223000,
1091224800, 1091225700, 1091226600, 1091227500, 1091228400, 1080029700,
1091221800, 1091224800, 1091226600), class = c("POSIXct", "POSIXt"
), tzone = "Etc/GMT-6"), V1 = c(1.6, 1.9, 1.9, 2, 2, 1.4, 1.4,
1.5, 1.5, 1.6, 2.6, 2.8, 3.4, 3.8, NA, NA, NA, NA), V2 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 42, 0, 0, 0), V3 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V4 = c(1.5,
2.3, 2.3, 2.4, 2.4, 7.8, 3.2, 4.9, 4.7, 3.4, 17.1, 25.4, 16.9,
30.6, NA, NA, NA, NA)), .Names = c("Date", "V1", "V2", "V3",
"V4"), row.names = c(NA, -18L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000000200788>)
# Date V1 V2 V3 V4
# 1: 2004-03-01 00:00:00 1.6 0 0 1.5
# 2: 2004-03-23 14:00:00 1.9 0 0 2.3
# 3: 2004-03-23 14:15:00 1.9 0 0 2.3
# 4: 2004-03-23 14:30:00 2.0 0 0 2.4
# 5: 2004-03-23 14:45:00 2.0 0 0 2.4
# 6: 2004-07-31 02:45:00 1.4 0 0 7.8
# 7: 2004-07-31 03:00:00 1.4 0 0 3.2
# 8: 2004-07-31 03:15:00 1.5 0 0 4.9
# 9: 2004-07-31 03:30:00 1.5 0 0 4.7
# 10: 2004-07-31 04:00:00 1.6 0 0 3.4
# 11: 2004-07-31 04:15:00 2.6 0 0 17.1
# 12: 2004-07-31 04:30:00 2.8 0 0 25.4
# 13: 2004-07-31 04:45:00 3.4 0 0 16.9
# 14: 2004-07-31 05:00:00 3.8 0 0 30.6
# 15: 2004-03-23 14:15:00 NA 42 0 NA
# 16: 2004-07-31 03:10:00 NA 0 0 NA
# 17: 2004-07-31 04:00:00 NA 0 0 NA
# 18: 2004-07-31 04:30:00 NA 0 0 NA
下一行是根据akrun提供的解决方案修改的:Identifying duplicated rows。
xy[, lapply(.SD, xy[which(duplicated(xy))] %+na% xy[which(duplicated(xy,
fromLast = TRUE))]), keyby = Date]
有人可以建议对最后一行进行修改,以便我不会收到此错误消息吗?:
# Note the new error message that I am receiving:
# Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE),
# nrow = nr, : length of 'dimnames' [2] not equal to array extent
提前谢谢。
这是下面的最终预期结果(已经更改,请参阅第8行与NA):
dput(xy)
xy <- structure(list(Date = structure(c(1078077600, 1080028800, 1080029700,
1080030600, 1080031500, 1091220300, 1091221200, 1091221800, 1091222100,
1091223000, 1091224800, 1091225700, 1091226600, 1091227500, 1091228400
), class = c("POSIXct", "POSIXt"), tzone = "Etc/GMT-6"), V1 = c(1.6,
1.9, 1.9, 2, 2, 1.4, 1.4, NA, 1.5, 1.5, 1.6, 2.6, 2.8, 3.4, 3.8
), V2 = c(0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V3 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), V4 = c(1.5, 2.3, 2.3,
2.4, 2.4, 7.8, 3.2, NA, 4.9, 4.7, 3.4, 17.1, 25.4, 16.9, 30.6
)), .Names = c("Date", "V1", "V2", "V3", "V4"), row.names = c(NA,
-15L), class = "data.frame")
# Date V1 V2 V3 V4
# 1 2004-03-01 00:00:00 1.6 0 0 1.5
# 2 2004-03-23 14:00:00 1.9 0 0 2.3
# 3 2004-03-23 14:15:00 1.9 42 0 2.3
# 4 2004-03-23 14:30:00 2.0 0 0 2.4
# 5 2004-03-23 14:45:00 2.0 0 0 2.4
# 6 2004-07-31 02:45:00 1.4 0 0 7.8
# 7 2004-07-31 03:00:00 1.4 0 0 3.2
# 8 2004-07-31 03:10:00 NA 0 0 NA <-- Notice: the change here
# 9 2004-07-31 03:15:00 1.5 0 0 4.9
# 10 2004-07-31 03:30:00 1.5 0 0 4.7
# 11 2004-07-31 04:00:00 1.6 0 0 3.4
# 12 2004-07-31 04:15:00 2.6 0 0 17.1
# 13 2004-07-31 04:30:00 2.8 0 0 25.4
# 14 2004-07-31 04:45:00 3.4 0 0 16.9
# 15 2004-07-31 05:00:00 3.8 0 0 30.6
xy 中的NA列将使用na.approx进行插值,从而更改此处发布的原始问题。
更新可能的解决方案
以下代码部分取自Arun的答案:
setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll = "nearest", which = TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
xy <- rbind(x, y[yidx, ])
setkey(xy, Date)
a和b都来自Identifying duplicated rows
a <- which(duplicated(xy, fromLast = TRUE))
b <- which(duplicated(xy))
有人可以提供更好的方法来继续执行以下步骤吗?
xyadd <- vector("list", length(a)) # pre-allocate the list since it is
# being used in a for loop / Sources:
Add a Column to a Dataframe From a List of Values和Why does is.vector() return TRUE for list?
for(u in seq(a))
{
xyadd[[u]] <- xy[a[u], .SD, .SDcols = 2:5] %+na% xy[b[u], .SD, .SDcols = 2:5]
}
xyadd2 <- data.frame(unlist(xyadd))
xyadd2 <- ifelse(nrow(xyadd2) > prod(length(a)*4), xyadd2 <-
data.frame(xyadd2[-nrow(xyadd2), ]), xyadd2)
# 4 comes from the 4 columns that I have
xyadd2 <- xyadd2[1][[1]]
xyadd2 <- matrix(data = xyadd2, nrow = length(a), ncol = 4, byrow = TRUE)
xyadd2 <- as.data.frame(xyadd2)
xyadd2 <- setDT(xyadd2)
xy[a, `:=` (V1 = xyadd2[, V1], V2 = xyadd2[, V2], V3 = xyadd2[, V3],
V4 = xyadd2[, V4])]
xy <- xy[-b, ]
如上所示,我得到了相同的 xy 。
我没有将上面的潜在解决方案作为答案发布,因为我希望收到有关提高代码效率的反馈意见。数据集 x 包含20,000多行,而数据集 y 包含少于5,000行。该解决方案需要应用于大约20个文件集。
非常感谢任何帮助。
提前谢谢。
答案 0 :(得分:2)
dplyr
解决方案
A <- expand.grid(y$Date, x$Date) #all possible combination of dates
#indices of y, of which the time diff to x is less than 10 min
ind <- which(abs(A$Var1-A$Var2)<10*60) %% nrow(y)
ind[ind==0] <- nrow(y)
y1 <- y[ind, ] #dump the obsolete values
library(dplyr)
bind_rows(x, y1) %>% #alternative to rbind(x,y1)
group_by(Date) %>%
summarise_each(funs(sum))
#Source: local data frame [15 x 5]
#
# Date V1 V2 V3 V4
#1 2004-03-01 00:00:00 1.6 0 0 1.5
#2 2004-03-23 14:00:00 1.9 0 0 2.3
#3 2004-03-23 14:15:00 1.9 42 0 2.3
#4 2004-03-23 14:30:00 2.0 0 0 2.4
#5 2004-03-23 14:45:00 2.0 0 0 2.4
#6 2004-07-31 02:45:00 1.4 0 0 7.8
#7 2004-07-31 03:00:00 1.4 0 0 3.2
#8 2004-07-31 03:10:00 0.0 0 0 0.0
#9 2004-07-31 03:15:00 1.5 0 0 4.9
#10 2004-07-31 03:30:00 1.5 0 0 4.7
#11 2004-07-31 04:00:00 1.6 0 0 3.4
#12 2004-07-31 04:15:00 2.6 0 0 17.1
#13 2004-07-31 04:30:00 2.8 0 0 25.4
#14 2004-07-31 04:45:00 3.4 0 0 16.9
#15 2004-07-31 05:00:00 3.8 0 0 30.6
答案 1 :(得分:2)
使用data.tabe
的滚动连接:
require(data.table)
setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll="nearest", which=TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
ans = rbind(x, y[yidx]) ## uses faster 'rbind.data.table'
ans[, lapply(.SD, sum), keyby=Date]
我解决问题的方法如下:
使用x
列上的y
加入roll="nearest"
的最近值,获取Date
的所有索引。然后,在y
中找到最近匹配日期最多相隔10分钟的那些索引。使用它来分组y
并将其与x
绑定,然后聚合。
答案 2 :(得分:1)
主要基于Arun的回答,我创建了以下data.table
解决方案。
setkey(setDT(x), Date)
setkey(setDT(y), Date)
xidx = x[y, roll = "nearest", which = TRUE]
yidx = which(abs(x$Date[xidx] - y$Date) <= 600)
xy <- rbind(x, y[yidx, ]) # In this line, I have added the comma after yidx,
# which is the only difference from Arun's solution.
xy <- xy[, lapply(.SD, sum), keyby = Date]
<强>更新强>
根据Arun的评论,我在下面修改了这个解决方案。谢谢阿伦。
# xy
# Date V1 V2 V3 V4
# 1: 2004-03-01 00:00:00 1.6 0 0 1.5
# 2: 2004-03-23 14:00:00 1.9 0 0 2.3
# 3: 2004-03-23 14:15:00 1.9 42 0 2.3
# 4: 2004-03-23 14:30:00 2.0 0 0 2.4
# 5: 2004-03-23 14:45:00 2.0 0 0 2.4
# 6: 2004-07-31 02:45:00 1.4 0 0 7.8
# 7: 2004-07-31 03:00:00 1.4 0 0 3.2
# 8: 2004-07-31 03:10:00 0.0 0 0 0.0
# 9: 2004-07-31 03:15:00 1.5 0 0 4.9
# 10: 2004-07-31 03:30:00 1.5 0 0 4.7
# 11: 2004-07-31 04:00:00 1.6 0 0 3.4
# 12: 2004-07-31 04:15:00 2.6 0 0 17.1
# 13: 2004-07-31 04:30:00 2.8 0 0 25.4
# 14: 2004-07-31 04:45:00 3.4 0 0 16.9
# 15: 2004-07-31 05:00:00 3.8 0 0 30.6