你能解释一下这个by-group data.table结果

时间:2016-08-17 14:56:56

标签: r data.table

我必须在这里做一些明显愚蠢的事情,但有人可以解释为什么data.table没有通过小组操作进行以下操作

set.seed(1)
DT = data.table(grp=c(rep('a',100),rep('b',100)), val=c(runif(100), rnorm(100)))
DT[grp=='a',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

          10%    20%    30%    40%    50%    60%    70%    80%    90%        
  -Inf 0.1415 0.2555 0.3448 0.4108 0.4878 0.6442 0.7140 0.7842 0.8703    Inf 

DT[grp=='b',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

              10%      20%      30%      40%      50%      60%      70%      80%      90%          
    -Inf -1.22751 -0.66000 -0.55036 -0.32170 -0.11762  0.06583  0.37427  0.69183  1.35196      Inf 

DT[,interval:=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)),.(grp)][]

     grp     val        interval
  1:   a  0.2655   (-0.66,-0.55] => this is a "b" interval ? I would expect (0.2555 0.3448]
  2:   a  0.3721  (-0.55,-0.322]
  3:   a  0.5729 (-0.118,0.0658]
  4:   a  0.9082     (1.35, Inf]
  5:   a  0.2017   (-1.23,-0.66]
 ---                            
196:   b -0.7508   (-1.23,-0.66]
197:   b  2.0872     (1.35, Inf]
198:   b  0.0174 (-0.118,0.0658]
199:   b -1.2863    (-Inf,-1.23]
200:   b -1.6406    (-Inf,-1.23]

我通常想做的事情如下:

DT[,mean(val),keyby=.(grp,interval=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))]
    grp        interval          V1
 1:   a (-0.321,0.0379]  0.01836077  => this is not a "a" interval
 2:   a   (0.0379,0.21]  0.13190935
 3:   a    (0.21,0.358]  0.29068707
 4:   a   (0.358,0.477]  0.41647597
 5:   a   (0.477,0.648]  0.55190648
 6:   a   (0.648,0.777]  0.70883795
 7:   a   (0.777,0.915]  0.84091210
 8:   a    (0.915, Inf]  0.95797615
 9:   b   (-Inf,-0.657] -1.23322909
10:   b (-0.657,-0.321] -0.53243898
11:   b (-0.321,0.0379] -0.13968720
12:   b   (0.0379,0.21]  0.11278201
13:   b    (0.21,0.358]  0.30783459
14:   b   (0.358,0.477]  0.40695489
15:   b   (0.477,0.648]  0.55976052
16:   b   (0.648,0.777]  0.70483170
17:   b   (0.777,0.915]  0.91017423
18:   b    (0.915, Inf]  1.57112705

如果在WHOLE数据集而不是组

上定义了区间,这看起来很可疑
DT[,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]
                    10%         20%         30%         40%         50%         60%         70%         80%         90%             
       -Inf -0.65729223 -0.32084835  0.03788176  0.20967534  0.35835115  0.47738589  0.64820328  0.77734560  0.91505885         Inf 

1 个答案:

答案 0 :(得分:3)

看起来你期待一种奇妙的方式来组合因子水平(这是cut创建的)跨组。相反,你发现了奇怪的行为,这是典型的因素。

我猜你可以使用字符串:

DT[,interval := 
  as.character(cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))
, by=grp]

给出了

     grp         val        interval
  1:   a  0.26550866   (0.256,0.345]
  2:   a  0.37212390   (0.345,0.411]
  3:   a  0.57285336   (0.488,0.644]
  4:   a  0.90820779     (0.87, Inf]
  5:   a  0.20168193   (0.142,0.256]
 ---                                
196:   b -0.75081900   (-1.23,-0.66]
197:   b  2.08716655     (1.35, Inf]
198:   b  0.01739562 (-0.118,0.0658]
199:   b -1.28630053    (-Inf,-1.23]
200:   b -1.64060553    (-Inf,-1.23]

然而,这些间隔并不适用于任何事情。如果您尝试按他们排序,例如DT[, mean(val), keyby=.(grp, interval)],您会发现他们已经无序。

如果你只是希望这些削减进行一次计算......

mycut = function(x) cut(x,c(-Inf,quantile(x,probs=seq(.1,.9,.1)),Inf))

DT[,{
    .SD[, mean(val), keyby=.(interval=mycut(val))][, interval := as.character(interval)]
},keyby=grp]

给出了

    grp        interval          V1
 1:   a    (-Inf,0.142]  0.07670249
 2:   a   (0.142,0.256]  0.20584852
 3:   a   (0.256,0.345]  0.30715649
 4:   a   (0.345,0.411]  0.38583465
 5:   a   (0.411,0.488]  0.45901975
 6:   a   (0.488,0.644]  0.56413855
 7:   a   (0.644,0.714]  0.67442643
 8:   a   (0.714,0.784]  0.75834958
 9:   a    (0.784,0.87]  0.82747749
10:   a     (0.87, Inf]  0.91951669
11:   b    (-Inf,-1.23] -1.54198329
12:   b   (-1.23,-0.66] -0.92447488
13:   b   (-0.66,-0.55] -0.61458549
14:   b  (-0.55,-0.322] -0.45029247
15:   b (-0.322,-0.118] -0.22533466
16:   b (-0.118,0.0658] -0.01587467
17:   b  (0.0658,0.374]  0.24836075
18:   b   (0.374,0.692]  0.53061032
19:   b    (0.692,1.35]  1.01688411
20:   b     (1.35, Inf]  1.80089535

是的,不是很优雅,但我认为这是R本身的一个问题,而且为了解决你的问题它应该如何改变并不明显。