使用分位数箱的ID的data.table中的新列值

时间:2013-10-18 05:16:18

标签: r data.table

quantile(X, prob = seq(0, 1, length = 5), type = 5)

如何将此传输到data.table操作以使用:=添加新列,并为每个ID分配一个值,如果值在bin中,则分配适当的有序值,如25%=每个ID,1,50%= 2等?

2 个答案:

答案 0 :(得分:4)

您可以使用findInterval。这将允许您使用quantile及其各种定义。

例如

findInterval(x, quantile(x,type=5), rightmost.closed=TRUE)

# It is fast
set.seed(1)
DT <- data.table(x=rnorm(1e6))

library(microbenchmark)


microbenchmark(
  order = DT[order(x),bin:=ceiling(.I/.N*5)],
  findInterval = DT[, b2 :=findInterval(x, quantile(x,type=5), rightmost.closed=TRUE)],times=10 )
## Unit: milliseconds
##         expr       min        lq    median       uq      max neval
##        order 551.31154 568.20324 573.36605 640.3255 655.5024    10
## findInterval  70.16782  79.11459  80.36363 140.2807 147.3080    10

答案 1 :(得分:2)

对于没有联系的数据,一个简单的解决方案就是手动拆分......

set.seed(1)
DT <- data.table(x=rnorm(20))
DT[order(x),bin:=ceiling(.I/.N*5)]

导致

              x bin
 1: -0.62645381   1
 2:  0.18364332   3
 3: -0.83562861   1
 4:  1.59528080   5
 5:  0.32950777   3
 6: -0.82046838   1
 7:  0.48742905   3
 8:  0.73832471   4
 9:  0.57578135   4
10: -0.30538839   2
11:  1.51178117   5
12:  0.38984324   3
13: -0.62124058   2
14: -2.21469989   1
15:  1.12493092   5
16: -0.04493361   2
17: -0.01619026   2
18:  0.94383621   5
19:  0.82122120   4
20:  0.59390132   4