有人请求我在这里(R: Applying cumulative sum function and filling data gaps with NA for plotting)稍微分解我的问题并发布一个较小的样本。在这里,您可以在这里找到我的示例数据:https://dl.dropboxusercontent.com/u/16277659/inputdata.csv
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1883; 1883; 0
SAMPLE1; 253; 1884; 1883; NA
SAMPLE1; 253; 1885; 1884; 12
SAMPLE1; 253; 1890; 1889; 17
SAMPLE2; 261; 1991; 1991; 0
SAMPLE2; 261; 1992; 1991; -19
SAMPLE2; 261; 1994; 1992; -58
SAMPLE2; 261; 1995; 1994; -40
我想计算列VALUE的累积和,并用NA值填充数据间隙(数据的结构应该相同,因为我需要其他列进行进一步处理)。
填写数据间隙时,应填写NAs,如SAMPLE1中所示。请注意在填充CUMSUM列中的多个NA时NA之后的值的位置(例如,除了VALUE中的最后一个NA(用于绘图原因)之外,应填写最后一个CUMSUM值。
当REFERENCE_YEAR和SURVEY_YEAR之间的时间段大于一年时,会出现例外情况,该值应该像1992年至1994年期间的SAMPLE2一样写入列中。
这只是一个样本数据集,我的实际数据集由几列和大约40000行组成。 Best是BaseR的解决方案。每个SAMPLE的第一行中的REFERENCE_YEAR和SURVEY_YEAR相等是我用于为每个组写入零列的代码的结果。
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUM
SAMPLE1; 253; 1883; 1883; 0; 0
SAMPLE1; 253; 1884; 1883; NA; NA
SAMPLE1; 253; 1885; 1884; 12; 12
SAMPLE1; 253; 1886; 1885; NA; NA
SAMPLE1; 253; 1887; 1886; NA; NA
SAMPLE1; 253; 1888; 1887; NA; NA
SAMPLE1; 253; 1889; 1888; NA; 12
SAMPLE1; 253; 1890; 1889; 17; 29
SAMPLE2; 261; 1991; 1991; 0; 0
SAMPLE2; 261; 1992; 1991; -19; -19
SAMPLE2; 261; 1993; 1992; -58; -77
SAMPLE2; 261; 1994; 1992; -58; -77
SAMPLE2; 261; 1995; 1994; -40; -117
答案 0 :(得分:3)
如果dat
是数据集,则一种方法是:
通过在每个SURVEY_YEAR
NAME
之间展开来创建新数据集
dat1 <- setNames(stack(
with(dat, tapply(SURVEY_YEAR, NAME,
FUN=function(x) seq(min(x), max(x)))))[2:1], c("NAME", "SURVEY_YEAR"))
将新数据集dat1
与旧dat
datN <- merge(dat1, dat, all=TRUE)
将REFERENCE_YEAR
中的缺失值替换为前一行的SURVEY_YEAR
datN$REFERENCE_YEAR[is.na(datN$REFERENCE_YEAR)] <- datN$SURVEY_YEAR[which(is.na(datN$REFERENCE_YEAR))-1]
使用na.locf
中的zoo
填写ID
library(zoo)
datN$ID <- na.locf(datN$ID)
datN$CUMSUM <- NA
对非NA cumsum
行和
VALUE
datN$CUMSUM[!is.na(datN$VALUE)] <- unlist(with(datN, tapply(VALUE, NAME, FUN=function(x) cumsum(x[!is.na(x)]))))
查找SURVEY_YEAR和REFERENCE_YEAR&gt; 1之间存在差异的行
indx <- with(datN, SURVEY_YEAR-REFERENCE_YEAR)>1
将VALUE
和CUMSUM
列中的行替换为下一行值
datN[,c("VALUE", "CUMSUM")] <- lapply(datN[,c("VALUE", "CUMSUM")], function(x) {x[which(indx)-1] <- x[indx]; x})
将NA
中的部分CUMSUM
值更改为之前的non-NA
值
datN$CUMSUM <- with(datN, ave(CUMSUM, NAME, FUN = function(x) {
x1 <- is.na(x)
rl <- rle(x1)
indx <- which(!(!(abs(x1 - 1) * (cumsum(x1) != 0) * sequence(rl$lengths)))) - 1
indx1 <- indx[indx - c(1, indx[-length(indx)]) > 1]
indxn <- unlist(lapply(indx1, function(y) {
indx2 <- which(!is.na(x))
tail(indx2[which(indx2 < y)], 1)
}))
x[indx1] <- x[indxn]
x
}))
datN
# NAME SURVEY_YEAR ID REFERENCE_YEAR VALUE CUMSUM
#1 SAMPLE1 1883 253 1883 0 0
#2 SAMPLE1 1884 253 1883 NA NA
#3 SAMPLE1 1885 253 1884 12 12
#4 SAMPLE1 1886 253 1885 NA NA
#5 SAMPLE1 1887 253 1886 NA NA
#6 SAMPLE1 1888 253 1887 NA NA
#7 SAMPLE1 1889 253 1888 NA 12
#8 SAMPLE1 1890 253 1889 17 29
#9 SAMPLE2 1991 261 1991 0 0
#10 SAMPLE2 1992 261 1991 -19 -19
#11 SAMPLE2 1993 261 1992 -58 -77
#12 SAMPLE2 1994 261 1992 -58 -77
#13 SAMPLE2 1995 261 1994 -40 -117