表示子集

时间:2017-07-06 19:35:28

标签: r data.table

还在学习R,但我个人认为这是不可能的,我希望你们中的一个能证明我是错的。

我希望找到值<= 25th百分位数的平均值,值的平均值&gt; = 75%百分位数;但不是整个数据集。我想找到这些数据子集的方法,从中找到百分位数。

这将生成类似于我自己的数据:

library(data.table)
DT <- data.table(V1 <- c('AR','AR','AR','AR','AR','AR','AD','AD','AD','AD','AD','AD','BD',
                         'BD','BD','BD','BX','CX','DX','DX','DD','DD','DD','DD','DR','DR',
                         'DR','DR','DR','DR'),
                 V2 <- c(.12,.02,.03,.22,.44,.09,.11,.17,.15,.26,.29,.27,.16,.16,.02,.12,.02,
                         .03,.22,.44,.09,.11,.17,.15,.26,.29,.27,.16,.16,.02))

看起来像:

    V1   V2
 1: AR 0.12
 2: AR 0.02
 3: AR 0.03
 4: AR 0.22
 5: AR 0.44
 6: AR 0.09
 7: AD 0.11
 8: AD 0.17
 9: AD 0.15
10: AD 0.26
11: AD 0.29
12: AD 0.27
13: BD 0.16
14: BD 0.16
15: BD 0.02
16: BD 0.12
17: BX 0.02
18: CX 0.03
19: DX 0.22
20: DX 0.44
21: DD 0.09
22: DD 0.11
23: DD 0.17
24: DD 0.15
25: DR 0.26
26: DR 0.29
27: DR 0.27
28: DR 0.16
29: DR 0.16
30: DR 0.02

第一步:计算每个A_,B_,C_,D_的中位数,第25百分位数,第75百分位数和计数外观。知道了:

dt.qtile <- DT[, list(Bottom = quantile(V2, .25), 
                      Middle = quantile(V2, .5),  
                         Top = quantile(V2, .75),
                 Appearances = .N), by = V1]

产地:

   V1 Bottom Middle    Top Appearances
1: AR  0.045  0.105 0.1950           6
2: AD  0.155  0.215 0.2675           6
3: BD  0.095  0.140 0.1600           4
4: BX  0.020  0.020 0.0200           1
5: CX  0.030  0.030 0.0300           1
6: DX  0.275  0.330 0.3850           2
7: DD  0.105  0.130 0.1550           4
8: DR  0.160  0.210 0.2675           6

这是我认为不可能的地方。我想找到原始V2(DT $ V2)中的值小于或等于第25个百分位数的值,然后大于或等于第75个百分位的每个字母组合 V1

    V1   V2
 1: AR 0.12 -  Ignore   -
 2: AR 0.02 <= 0.045    \
 3: AR 0.03 <= 0.045    / mean = 0.05 (Bottom)
 4: AR 0.22 >= 0.1950   \
 5: AR 0.44 >= 0.1950   / mean = 0.33 (Top)
 6: AR 0.09 -  Ignore   -
    ------
 7: AD 0.11 <= 0.155    > mean = 0.11 (Bottom)
 8: AD 0.17 -  Ignore   -
 9: AD 0.15 -  Ignore   -
10: AD 0.26 >= 0.2675   \
11: AD 0.29 >= 0.2675    | mean = 0.2733 (Top)
12: AD 0.27 >= 0.2675   /
      ...
25: DR 0.26 -  Ignore   -
26: DR 0.29 >= 0.2675   \
27: DR 0.27 >= 0.2675   / mean = 0.28 (Top)
28: DR 0.16 <= 0.16    \
29: DR 0.16 <= 0.16     | mean = 0.17 (Bottom)
30: DR 0.02 <= 0.16    /

将V2中的值平均为&lt; = 25th百分位数,然后平均值> = 75th百分位数。

新输出应该是这样的:

   V1 Bottom Middle    Top Appearances
1: AR  0.025  0.105 0.3300           6
2: AD  0.110  0.215 0.2733           6
                   ...
8: DR  0.170  0.210 0.2800           6

这让我很接近:

DT[V2 < quantile(V2, .25), mean(V2), by = V1]

但是它计算整个数据集的分位数,而不是每个字母组合。

所以我试试:

 DT[V2 < DT[, quantile(V2, .25), by = V1], mean(V2), by = V1]

我明白了:

Error in `[.data.table`(DT, V2 < DT[, quantile(V2, 0.25), by = V1], mean(V2),  : 
  i is invalid type (matrix). 
Perhaps in future a 2 column matrix could return a list of elements of DT
 (in the spirit of A[B] in FAQ 2.14). 
Please let datatable-help know if you'd like this, or add your comments to FR #657.

我知道这必须简单,但我看不到它。我错过了什么?让我知道我可以澄清的地方。

我提前感谢您的帮助!

修改

DT[, list( Bottom = mean(V2[V2 <= quantile(V2, 0.25)]), 
           Middle = median(V2), 
              Top = mean(V2[V2 >= quantile(V2, 0.75)]), 
      Appearances = .N), by = V1]

永远不会自己找到这个。

1 个答案:

答案 0 :(得分:1)

DT[, mean(V2[V2 < quantile(V2, 0.25)]), by = V1]
   V1    V1
1: AR 0.025
2: AD 0.130
3: BD 0.020
4: BX   NaN
5: CX   NaN
6: DX 0.220
7: DD 0.090
8: DR 0.020

DT[, mean(V2[V2 > quantile(V2, 0.75)]), by = V1]
   V1   V1
1: AR 0.33
2: AD 0.28
3: BD  NaN
4: BX  NaN
5: CX  NaN
6: DX 0.44
7: DD 0.17
8: DR 0.28