我必须在这里做一些明显愚蠢的事情,但有人可以解释为什么data.table
没有通过小组操作进行以下操作
set.seed(1)
DT = data.table(grp=c(rep('a',100),rep('b',100)), val=c(runif(100), rnorm(100)))
DT[grp=='a',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]
10% 20% 30% 40% 50% 60% 70% 80% 90%
-Inf 0.1415 0.2555 0.3448 0.4108 0.4878 0.6442 0.7140 0.7842 0.8703 Inf
DT[grp=='b',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]
10% 20% 30% 40% 50% 60% 70% 80% 90%
-Inf -1.22751 -0.66000 -0.55036 -0.32170 -0.11762 0.06583 0.37427 0.69183 1.35196 Inf
DT[,interval:=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)),.(grp)][]
grp val interval
1: a 0.2655 (-0.66,-0.55] => this is a "b" interval ? I would expect (0.2555 0.3448]
2: a 0.3721 (-0.55,-0.322]
3: a 0.5729 (-0.118,0.0658]
4: a 0.9082 (1.35, Inf]
5: a 0.2017 (-1.23,-0.66]
---
196: b -0.7508 (-1.23,-0.66]
197: b 2.0872 (1.35, Inf]
198: b 0.0174 (-0.118,0.0658]
199: b -1.2863 (-Inf,-1.23]
200: b -1.6406 (-Inf,-1.23]
我通常想做的事情如下:
DT[,mean(val),keyby=.(grp,interval=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))]
grp interval V1
1: a (-0.321,0.0379] 0.01836077 => this is not a "a" interval
2: a (0.0379,0.21] 0.13190935
3: a (0.21,0.358] 0.29068707
4: a (0.358,0.477] 0.41647597
5: a (0.477,0.648] 0.55190648
6: a (0.648,0.777] 0.70883795
7: a (0.777,0.915] 0.84091210
8: a (0.915, Inf] 0.95797615
9: b (-Inf,-0.657] -1.23322909
10: b (-0.657,-0.321] -0.53243898
11: b (-0.321,0.0379] -0.13968720
12: b (0.0379,0.21] 0.11278201
13: b (0.21,0.358] 0.30783459
14: b (0.358,0.477] 0.40695489
15: b (0.477,0.648] 0.55976052
16: b (0.648,0.777] 0.70483170
17: b (0.777,0.915] 0.91017423
18: b (0.915, Inf] 1.57112705
如果在WHOLE数据集而不是组
上定义了区间,这看起来很可疑DT[,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]
10% 20% 30% 40% 50% 60% 70% 80% 90%
-Inf -0.65729223 -0.32084835 0.03788176 0.20967534 0.35835115 0.47738589 0.64820328 0.77734560 0.91505885 Inf
答案 0 :(得分:3)
看起来你期待一种奇妙的方式来组合因子水平(这是cut
创建的)跨组。相反,你发现了奇怪的行为,这是典型的因素。
我猜你可以使用字符串:
DT[,interval :=
as.character(cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))
, by=grp]
给出了
grp val interval
1: a 0.26550866 (0.256,0.345]
2: a 0.37212390 (0.345,0.411]
3: a 0.57285336 (0.488,0.644]
4: a 0.90820779 (0.87, Inf]
5: a 0.20168193 (0.142,0.256]
---
196: b -0.75081900 (-1.23,-0.66]
197: b 2.08716655 (1.35, Inf]
198: b 0.01739562 (-0.118,0.0658]
199: b -1.28630053 (-Inf,-1.23]
200: b -1.64060553 (-Inf,-1.23]
然而,这些间隔并不适用于任何事情。如果您尝试按他们排序,例如DT[, mean(val), keyby=.(grp, interval)]
,您会发现他们已经无序。
如果你只是希望这些削减进行一次计算......
mycut = function(x) cut(x,c(-Inf,quantile(x,probs=seq(.1,.9,.1)),Inf))
DT[,{
.SD[, mean(val), keyby=.(interval=mycut(val))][, interval := as.character(interval)]
},keyby=grp]
给出了
grp interval V1
1: a (-Inf,0.142] 0.07670249
2: a (0.142,0.256] 0.20584852
3: a (0.256,0.345] 0.30715649
4: a (0.345,0.411] 0.38583465
5: a (0.411,0.488] 0.45901975
6: a (0.488,0.644] 0.56413855
7: a (0.644,0.714] 0.67442643
8: a (0.714,0.784] 0.75834958
9: a (0.784,0.87] 0.82747749
10: a (0.87, Inf] 0.91951669
11: b (-Inf,-1.23] -1.54198329
12: b (-1.23,-0.66] -0.92447488
13: b (-0.66,-0.55] -0.61458549
14: b (-0.55,-0.322] -0.45029247
15: b (-0.322,-0.118] -0.22533466
16: b (-0.118,0.0658] -0.01587467
17: b (0.0658,0.374] 0.24836075
18: b (0.374,0.692] 0.53061032
19: b (0.692,1.35] 1.01688411
20: b (1.35, Inf] 1.80089535
是的,不是很优雅,但我认为这是R本身的一个问题,而且为了解决你的问题它应该如何改变并不明显。