假设我有以下data.table
library(data.table)
dt <- data.table(x1 = c(1:12), x2=c(21:32))
然后,我使用以下内容按用户指定的间隔创建容器:
dt[,intx1:=cut(x1, breaks = c(-Inf, 4, 9, Inf))]
返回,
x1 x2 intx1
1: 1 21 (-Inf,4]
2: 2 22 (-Inf,4]
3: 3 23 (-Inf,4]
4: 4 24 (-Inf,4]
5: 5 25 (4,9]
6: 6 26 (4,9]
7: 7 27 (4,9]
8: 8 28 (4,9]
9: 9 29 (4,9]
10: 10 30 (9, Inf]
11: 11 31 (9, Inf]
12: 12 32 (9, Inf]
我试图找到箱子和变量之间的平均差异:
dt[, mux1_grp:=mean(x1), by = intx1][,mux1_pop:=mean(x1)][,mux1_diff:=mux1_grp-mux1_pop]
dt[,`:=`(intx1=NULL, mux1_grp=NULL, mux1_pop=NULL)]
回报是:
x1 x2 mux1_diff
1: 1 21 -4.0
2: 2 22 -4.0
3: 3 23 -4.0
4: 4 24 -4.0
5: 5 25 0.5
6: 6 26 0.5
7: 7 27 0.5
8: 8 28 0.5
9: 9 29 0.5
10: 10 30 4.5
11: 11 31 4.5
12: 12 32 4.5
但是,我的原始数据包含几个变量(例如,x1,x2,...,x20) 所以,我必须重复x2的相同程序如下:
dt[,intx2:=cut(x2, breaks = c(-Inf, 25, 28, Inf))]
dt[, mux2_grp:=mean(x2), by = intx2][,mux2_pop:=mean(x2)][,mux2_diff:=mux2_grp-mux2_pop]
dt[,`:=`(intx2=NULL, mux2_grp=NULL, mux2_pop=NULL)]
我的最终输出将是:
x1 x2 mux1_diff mux2_diff
1: 1 21 -4.0 -3.5
2: 2 22 -4.0 -3.5
3: 3 23 -4.0 -3.5
4: 4 24 -4.0 -3.5
5: 5 25 0.5 -3.5
6: 6 26 0.5 0.5
7: 7 27 0.5 0.5
8: 8 28 0.5 0.5
9: 9 29 0.5 4.0
10: 10 30 4.5 4.0
11: 11 31 4.5 4.0
12: 12 32 4.5 4.0
如何改进此代码?请注意,每个变量都有不同的用户指定间隔
答案 0 :(得分:2)
我们可以通过紧凑的一步式选项来实现这一点(尽管与OP的方法(来自@Frank&#39;评论)相比,它可能不是最佳选择
dt[, mu_diff := mean(x) - mean(dt$x), by = .(cut(x, breaks = c(-Inf, 4, 9, Inf)))][]
# x mu_diff
#1: 1 -3.8636364
#2: 2 -3.8636364
#3: 3 -3.8636364
#4: 4 -3.8636364
#5: 5 0.3863636
#6: 6 0.3863636
#7: 7 0.3863636
#8: 9 0.3863636
#9: 10 4.6363636
#10:11 4.6363636
#11:12 4.6363636
如果有很多变量(不清楚我们是否在breaks
中使用相同的cut
或不同的列 - 假设它是相同的),我们可以遍历列(在下面的可重现示例中,显示了两个变量&#39; x1&#39;&#39; x2&#39;,通过列的数字索引指定.SDcols
,按{{1}分组在子集列中,我们将新列指定为组中值cut
与整列mean
之间的差异。
mean
更新 - 假设每列的nm1 <- paste0("mu_diff", seq_along(dt1))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = c(-Inf, 4, 9, Inf))) ,
.SDcols = j][]
}
breaks
cut
变量与不同,请将其放在list
中使用索引在list
循环中获取for
元素。
brkLst <- list(c(-Inf, 4, 9, Inf), c(-Inf, 10, 14, Inf))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = brkLst[[j]])) ,
.SDcols = j][]
}
使用OP的新数据检查输出(&#39; dt2&#39;)
brkLst2 <- list(c(-Inf, 4, 9, Inf), c(-Inf, 25, 28, Inf))
nm1 <- paste0("mu", names(dt2), "_diff")
for(j in seq_along(dt2)){
dt2[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt2[[names(dt2)[j]]]),
by = .(cut(get(names(dt2)[j]), breaks = brkLst2[[j]])) ,
.SDcols = j][]
}
dt2
# x1 x2 mux1_diff mux2_diff
# 1: 1 21 -4.0 -3.5
# 2: 2 22 -4.0 -3.5
# 3: 3 23 -4.0 -3.5
# 4: 4 24 -4.0 -3.5
# 5: 5 25 0.5 -3.5
# 6: 6 26 0.5 0.5
# 7: 7 27 0.5 0.5
# 8: 8 28 0.5 0.5
# 9: 9 29 0.5 4.0
#10: 10 30 4.5 4.0
#11: 11 31 4.5 4.0
#12: 12 32 4.5 4.0
dt1 <- data.table(x1 = c(1,2,3,4,5,6,7,9,10,11,12))[, x2 := x1 + 5][]
#OP's changed dataset
dt2 <- data.table(x1 = 1:12, x2=21:32)