R:填补NA的数据空白并应用cumsum函数

时间:2014-08-13 01:39:35

标签: r dataframe na cumsum

有人请求我在这里(R: Applying cumulative sum function and filling data gaps with NA for plotting)稍微分解我的问题并发布一个较小的样本。在这里,您可以在这里找到我的示例数据:https://dl.dropboxusercontent.com/u/16277659/inputdata.csv

NAME;       ID;     SURVEY_YEAR;    REFERENCE_YEAR; VALUE
SAMPLE1;    253;    1883;           1883;           0
SAMPLE1;    253;    1884;           1883;           NA
SAMPLE1;    253;    1885;           1884;           12
SAMPLE1;    253;    1890;           1889;           17
SAMPLE2;    261;    1991;           1991;           0
SAMPLE2;    261;    1992;           1991;           -19
SAMPLE2;    261;    1994;           1992;           -58
SAMPLE2;    261;    1995;           1994;           -40

我想计算列VALUE的累积和,并用NA值填充数据间隙(数据的结构应该相同,因为我需要其他列进行进一步处理)。

填写数据间隙时,应填写NAs,如SAMPLE1中所示。请注意在填充CUMSUM列中的多个NA时NA之后的值的位置(例如,除了VALUE中的最后一个NA(用于绘图原因)之外,应填写最后一个CUMSUM值。

当REFERENCE_YEAR和SURVEY_YEAR之间的时间段大于一年时,会出现例外情况,该值应该像1992年至1994年期间的SAMPLE2一样写入列中。

这只是一个样本数据集,我的实际数据集由几列和大约40000行组成。 Best是BaseR的解决方案。每个SAMPLE的第一行中的REFERENCE_YEAR和SURVEY_YEAR相等是我用于为每个组写入零列的代码的结果。

NAME;       ID;     SURVEY_YEAR;    REFERENCE_YEAR; VALUE;  CUMSUM
SAMPLE1;    253;    1883;           1883;           0;      0
SAMPLE1;    253;    1884;           1883;           NA;     NA
SAMPLE1;    253;    1885;           1884;           12;     12
SAMPLE1;    253;    1886;           1885;           NA;     NA
SAMPLE1;    253;    1887;           1886;           NA;     NA
SAMPLE1;    253;    1888;           1887;           NA;     NA
SAMPLE1;    253;    1889;           1888;           NA;     12
SAMPLE1;    253;    1890;           1889;           17;     29
SAMPLE2;    261;    1991;           1991;           0;      0
SAMPLE2;    261;    1992;           1991;           -19;    -19
SAMPLE2;    261;    1993;           1992;           -58;    -77
SAMPLE2;    261;    1994;           1992;           -58;    -77
SAMPLE2;    261;    1995;           1994;           -40;    -117

----------------------------------------------- ---------------------------------------------


1 个答案:

答案 0 :(得分:3)

如果dat是数据集,则一种方法是:

通过在每个SURVEY_YEAR

的最小和最大NAME之间展开来创建新数据集
 dat1 <- setNames(stack(
             with(dat, tapply(SURVEY_YEAR, NAME, 
                FUN=function(x) seq(min(x), max(x)))))[2:1], c("NAME", "SURVEY_YEAR"))

将新数据集dat1与旧dat

合并
 datN <- merge(dat1, dat, all=TRUE)

REFERENCE_YEAR中的缺失值替换为前一行的SURVEY_YEAR

 datN$REFERENCE_YEAR[is.na(datN$REFERENCE_YEAR)] <- datN$SURVEY_YEAR[which(is.na(datN$REFERENCE_YEAR))-1]

使用na.locf中的zoo填写ID

的NA
 library(zoo)
 datN$ID <- na.locf(datN$ID)
 datN$CUMSUM <- NA

对非NA cumsum行和

进行VALUE
 datN$CUMSUM[!is.na(datN$VALUE)] <-  unlist(with(datN, tapply(VALUE, NAME, FUN=function(x) cumsum(x[!is.na(x)]))))

查找SURVEY_YEAR和REFERENCE_YEAR&gt; 1之间存在差异的行

 indx <- with(datN, SURVEY_YEAR-REFERENCE_YEAR)>1

VALUECUMSUM列中的行替换为下一行值

 datN[,c("VALUE", "CUMSUM")] <- lapply(datN[,c("VALUE", "CUMSUM")], function(x) {x[which(indx)-1] <- x[indx]; x})

NA中的部分CUMSUM值更改为之前的non-NA

datN$CUMSUM <- with(datN, ave(CUMSUM, NAME, FUN = function(x) {
x1 <- is.na(x)
rl <- rle(x1)
indx <- which(!(!(abs(x1 - 1) * (cumsum(x1) != 0) * sequence(rl$lengths)))) - 1
indx1 <- indx[indx - c(1, indx[-length(indx)]) > 1]
indxn <- unlist(lapply(indx1, function(y) {
    indx2 <- which(!is.na(x))
    tail(indx2[which(indx2 < y)], 1)
}))
x[indx1] <- x[indxn]
x
}))

datN
#      NAME SURVEY_YEAR  ID REFERENCE_YEAR VALUE CUMSUM
#1  SAMPLE1        1883 253           1883     0      0
#2  SAMPLE1        1884 253           1883    NA     NA
#3  SAMPLE1        1885 253           1884    12     12
#4  SAMPLE1        1886 253           1885    NA     NA
#5  SAMPLE1        1887 253           1886    NA     NA
#6  SAMPLE1        1888 253           1887    NA     NA
#7  SAMPLE1        1889 253           1888    NA     12
#8  SAMPLE1        1890 253           1889    17     29
#9  SAMPLE2        1991 261           1991     0      0
#10 SAMPLE2        1992 261           1991   -19    -19
#11 SAMPLE2        1993 261           1992   -58    -77
#12 SAMPLE2        1994 261           1992   -58    -77
#13 SAMPLE2        1995 261           1994   -40   -117