与之前的question保持一致,想象一下我有一个数据集:
Date rain code
2009-04-01 0.0 0
2009-04-02 0.0 0
2009-04-03 0.0 0
2009-04-04 0.7 1
2009-04-05 54.2 1
2009-04-06 0.0 0
2009-04-07 5.0 1
2009-04-08 9.0 0
2009-04-09 0.0 0
2009-04-10 0.0 0
2009-04-11 0.0 0
2009-04-12 5.3 1
2009-04-13 10.1 1
2009-04-14 6.0 1
2009-04-15 8.7 1
2009-04-16 0.0 0
2009-04-17 0.0 0
2009-04-18 0.0 0
2009-04-19 2.0 0
2009-04-20 3.0 0
2009-04-21 0.0 0
2009-04-22 0.0 0
2009-04-23 0.0 0
2009-04-24 0.0 0
2009-04-25 4.3 1
2009-04-26 42.2 1
2009-04-27 45.6 1
2009-04-28 12.6 1
2009-04-29 6.2 1
2009-04-30 1.0 1
DT = structure(list(Date = structure(c(14335, 14336, 14337, 14338,
14339, 14340, 14341, 14342, 14343, 14344, 14345, 14346, 14347,
14348, 14349, 14350, 14351, 14352, 14353, 14354, 14355, 14356,
14357, 14358, 14359, 14360, 14361, 14362, 14363, 14364), class = "Date"),
rain = c(0, 0, 0, 0.7, 54.2, 0, 5, 9, 0, 0, 0, 5.3, 10.1,
6, 8.7, 0, 0, 0, 2, 3, 0, 0, 0, 0, 4.3, 42.2, 45.6, 12.6,
6.2, 1), code = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L,
0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L)), .Names = c("Date", "rain", "code"), row.names = c(NA,
-30L), class = "data.frame")
当代码为1时,我试图折叠数据集以获取 rain 的连续值的总和。我需要将它们的总和直到第二天这个事件,包括在内。例如,我想分别从2009-04-13到2009-04-06,以及2009-04-07到2009-04-08得到降雨量的总和。所以我试图找到定义代码何时等于1和第二天包含的方法。最终产品应该看起来像:
Date rain code
2009-04-01 0.0 0
2009-04-02 0.0 0
2009-04-03 0.0 0
2009-04-06 54.9 1
2009-04-08 14.0 1
2009-04-09 0.0 0
2009-04-10 0.0 0
2009-04-11 0.0 0
2009-04-16 30.1 1
2009-04-17 0.0 0
2009-04-18 0.0 0
2009-04-19 2.0 0
2009-04-20 3.0 0
2009-04-21 0.0 0
2009-04-22 0.0 0
2009-04-23 0.0 0
2009-04-24 0.0 0
2009-04-30 111.9 1 (if last entry of data frame)
对于上述问题的任何帮助将不胜感激。
答案 0 :(得分:3)
这是一种方式:
library(data.table)
setDT(DT)
res = DT[, .(
Date = Date[.N],
rain = sum(rain),
code = code[1L]
), by=.(g = cumsum(shift(!code, fill=FALSE)))]
res[, g := NULL]
Date rain code
1: 2009-04-01 0.0 0
2: 2009-04-02 0.0 0
3: 2009-04-03 0.0 0
4: 2009-04-06 54.9 1
5: 2009-04-08 14.0 1
6: 2009-04-09 0.0 0
7: 2009-04-10 0.0 0
8: 2009-04-11 0.0 0
9: 2009-04-16 30.1 1
10: 2009-04-17 0.0 0
11: 2009-04-18 0.0 0
12: 2009-04-19 2.0 0
13: 2009-04-20 3.0 0
14: 2009-04-21 0.0 0
15: 2009-04-22 0.0 0
16: 2009-04-23 0.0 0
17: 2009-04-24 0.0 0
18: 2009-04-30 111.9 1
工作原理:
shift
正在从前一行获取值!code
的逻辑值时,TRUE / FALSE被视为1/0 .N
是by=
组一般语法是DT[, j, by]
,其中使用每个j
数据子集计算by
。
答案 1 :(得分:0)
如果您想使用基地R,您可以随时使用diff
来计算下雨开始和停止的时间。
start= which(diff(df$code)==1) +1
end = c(which(diff(df$code)==-1)+1, nrow(df))
l <- mapply(":", start, end)
让数据崩溃只是抛弃所有其他非停止日指数并用以下内容替换最后一天,它会在下雨停止的当天获得累积降雨量。
lapply(l, function(x) {
df[x,][length(x),"rain"] <- sum(df[x,"rain"])
df[x,][length(x),]
})
[[1]]
Date rain code
6 2009-04-06 54.9 0
[[2]]
Date rain code
8 2009-04-08 14 0
[[3]]
Date rain code
16 2009-04-16 30.1 0
[[4]]
Date rain code
30 2009-04-30 111.9 1