我正在尝试聚合一些数据,我想取一些变量的比例,并将它们的值设置为如下所示的列
library(data.table)
testDT <- data.table(z=sample(1:5, 2500000, replace=TRUE), a=sample(1:20, 2500000, replace=TRUE), b=sample(1:30, 2500000, replace=TRUE), c=sample(1:10, 2500000, replace=TRUE))
setkey(testDT, z)
testDT.AG=testDT[, list(
a_Mean=mean(as.numeric(a), na.rm = TRUE),
a_1_prop=length(which(a==1))/length(which(a>0)),
a_2_prop=length(which(a==2))/length(which(a>0)),
a_3_prop=length(which(a==3))/length(which(a>0)),
a_4_prop=length(which(a==4))/length(which(a>0)),
a_5_prop=length(which(a==5))/length(which(a>0)),
a_6_prop=length(which(a==6))/length(which(a>0)),
a_7_prop=length(which(a==7))/length(which(a>0)),
a_8_prop=length(which(a==8))/length(which(a>0)),
a_9_prop=length(which(a==9))/length(which(a>0)),
a_10_prop=length(which(a==10))/length(which(a>0))
), by=list(z)]
我想用下面的循环构建这个列表:
testDT.AG=testDT[, list(
a_Mean=mean(as.numeric(a), na.rm = TRUE),
for (i in c(1:10))
{
assign(paste("a_", i, "_prop"), length(which(a==i))/length(which(a>0))),
}
), by=list(z)]
但这不起作用......
无论如何都要在循环中构建这样的表达式列表吗?
提前谢谢!
答案 0 :(得分:1)
我让你的例子稍微小一些,但是你应该能够毫不费力地扩展它:
testDT <- data.table(z=sample(1:5, 2500, replace=TRUE), a=sample(1:20, 2500, replace=TRUE), b=sample(1:10, 2500, replace=TRUE), c=sample(1:10, 2500, replace=TRUE))
setkey(testDT, z)
prct.i <- function(a,i) sum(a==i)/sum(a>0)
testDT[ , setNames( lapply(1:3, prct.i, a=a), paste0("a_", 1:3, "_prop") ), by=z]
z a_1_prop a_2_prop a_3_prop
1: 1 0.04373757 0.04970179 0.05964215
2: 2 0.04678363 0.01949318 0.04483431
3: 3 0.04158416 0.06534653 0.05742574
4: 4 0.05296610 0.04872881 0.05084746
5: 5 0.05128205 0.04142012 0.04930966
两个“技巧”:使用lapply
返回列表,使用setNames
命名未命名的列表。不幸的是,对于函数式语言来说有点讽刺,在R for for循环中总是返回NULL。我后来意识到我需要添加手段:
testDT[ , c(a_Mean=mean(as.numeric(a), na.rm = TRUE),
setNames( lapply(1:3, prct.i, a=a), paste0("a_", 1:3, "_prop") )
), by=z]
z a_Mean a_1_prop a_2_prop a_3_prop
1: 1 10.62227 0.04373757 0.04970179 0.05964215
2: 2 10.93762 0.04678363 0.01949318 0.04483431
3: 3 10.50495 0.04158416 0.06534653 0.05742574
4: 4 10.64619 0.05296610 0.04872881 0.05084746
5: 5 10.75937 0.05128205 0.04142012 0.04930966
我根据原始代码的缩短版和更高效版检查了这些值:
testDT[, list(
a_Mean=mean(as.numeric(a), na.rm = TRUE),
a_1_prop=sum(a==1)/sum(a>0),
a_2_prop=sum(a==2)/sum(a>0),
a_3_prop=sum(a==3)/sum(a>0)
), by=list(z)]
z a_Mean a_1_prop a_2_prop a_3_prop
1: 1 10.62227 0.04373757 0.04970179 0.05964215
2: 2 10.93762 0.04678363 0.01949318 0.04483431
3: 3 10.50495 0.04158416 0.06534653 0.05742574
4: 4 10.64619 0.05296610 0.04872881 0.05084746
5: 5 10.75937 0.05128205 0.04142012 0.04930966