我在R中运行代码,其示例如下所示,带有一个小数据集 -
library(plyr)
Ex<-structure(list(X1 = c(-36.8598, -37.1726, -36.4343, -36.8644,
-37.0599, -34.8818, -31.9907, -37.8304,
-34.3367, -31.2984, -33.5731),
X2 = c(64.26, 63.085, 66.36, 61.08, 61.57, 65.04, 72.69, 63.83,
67.555, 76.06, 68.61),
Y1 = c(493.81544, 493.81544, 494.54173,
494.61364, 494.61381, 494.38717, 494.64122, 493.73265, 494.04246,
494.92989, 494.98384),
Y2 = c(489.704166, 489.704166, 490.710962,
490.653212, 490.710612, 489.822928,
488.160904, 489.747776, 490.600579,
488.946738, 490.398958),
Y3 = c(19L, 19L, 19L, 23L, 30L,43L,43L,2L, 58L, 47L, 61L),
date = c("2013-06-01","2013-06-02","2013-06-03","2013-06-04",
"2013-06-05","2013-06-06","2013-06-07","2013-06-08",
"2013-06-09","2013-06-10","2013-06-11")),
.Names = c("X1", "X2", "Y1", "Y2", "Y3", "date"),
row.names = c(NA, 11L), class = "data.frame")
Ex <- arrange(Ex, Y3)
Ex$Dup <- as.numeric(duplicated(Y3))
Ex$Dup_rev <- as.numeric(duplicated(Y3,fromLast=TRUE))
##Testing If Else
attach(Ex)
Ex$X5 <- 0
for(i in 1:length(Y3))
{
if (Ex$Dup[i]==0 & Ex$Dup_rev[i]==0)
{
Ex$X5[i]=Y2[i]
} else if(Ex$Dup[i]==0)
{
Ex$X5[i]=Y2[i]
}else
{Ex$X5[i]=Y2[i] + X5[i-1]}
}
这样做除非Y3列的值是第一次出现在数据集中,对于Y3的每一行,我们需要创建一个X5列,它是前一个Y2的累积和。 由于我的数据很大(大约110k行数据),因此这段代码需要花费大量时间来执行。有没有更简单的方法来执行相同的代码?
X1 X2 Y1 Y2 Y3 date Dup Dup_rev X5
1 -37.8304 63.830 493.7326 489.7478 2 2013-06-08 0 0 489.7478
2 -36.8598 64.260 493.8154 489.7042 19 2013-06-01 0 1 489.7042
3 -37.1726 63.085 493.8154 489.7042 19 2013-06-02 1 1 1469.1125
4 -36.4343 66.360 494.5417 490.7110 19 2013-06-03 1 0 1470.1193
5 -36.8644 61.080 494.6136 490.6532 23 2013-06-04 0 0 490.6532
答案 0 :(得分:2)
这是一个使用data.table
的解决方案,如果你用“因子”(在这种情况下是Y3)分割,这种分析的速度非常快:
library(data.table)
DT <- data.table(Ex)[, X5:=cumsum(Y2), by=Y3]
DT
# X1 X2 Y1 Y2 Y3 date X5
# 1: -37.8304 63.830 493.7326 489.7478 2 2013-06-08 489.7478
# 2: -36.8598 64.260 493.8154 489.7042 19 2013-06-01 489.7042
# 3: -37.1726 63.085 493.8154 489.7042 19 2013-06-02 979.4083
# 4: -36.4343 66.360 494.5417 490.7110 19 2013-06-03 1470.1193
# 5: -36.8644 61.080 494.6136 490.6532 23 2013-06-04 490.6532
# 6: -37.0599 61.570 494.6138 490.7106 30 2013-06-05 490.7106
# 7: -34.8818 65.040 494.3872 489.8229 43 2013-06-06 489.8229
# 8: -31.9907 72.690 494.6412 488.1609 43 2013-06-07 977.9838
# 9: -31.2984 76.060 494.9299 488.9467 47 2013-06-10 488.9467
# 10: -34.3367 67.555 494.0425 490.6006 58 2013-06-09 490.6006
# 11: -33.5731 68.610 494.9838 490.3990 61 2013-06-11 490.3990
请注意,就像杰克一样,我不明白你如何获得第14行的1469而不是979.4083。另外,我只是运行你的代码并得到了与我相同的答案,所以我猜你的样本结果中有一个拼写错误,或者数据是否有变化?
答案 1 :(得分:1)
这是dplyr的解决方案。 dplyr是plyr的下一次迭代,非常快。
library(dplyr)
Ex %.% group_by(Y3) %.% mutate(X5 = cumsum(Y2))
#> Source: local data frame [11 x 7]
#> Groups: Y3
#>
#> X1 X2 Y1 Y2 Y3 date X5
#> 1 -36.8598 64.260 493.8154 489.7042 19 2013-06-01 489.7042
#> 2 -37.1726 63.085 493.8154 489.7042 19 2013-06-02 979.4083
#> 3 -36.4343 66.360 494.5417 490.7110 19 2013-06-03 1470.1193
#> 4 -36.8644 61.080 494.6136 490.6532 23 2013-06-04 490.6532
#> 5 -37.0599 61.570 494.6138 490.7106 30 2013-06-05 490.7106
#> 6 -34.8818 65.040 494.3872 489.8229 43 2013-06-06 489.8229
#> 7 -31.9907 72.690 494.6412 488.1609 43 2013-06-07 977.9838
#> 8 -37.8304 63.830 493.7326 489.7478 2 2013-06-08 489.7478
#> 9 -34.3367 67.555 494.0425 490.6006 58 2013-06-09 490.6006
#> 10 -31.2984 76.060 494.9299 488.9467 47 2013-06-10 488.9467
#> 11 -33.5731 68.610 494.9838 490.3990 61 2013-06-11 490.3990