在data.table中按组分割的分位数

时间:2017-03-22 10:02:30

标签: r data.table quantile

我想为每个小组做分位数切割(切成n个分数相等的分词)

qcut = function(x, n) {
  quantiles = seq(0, 1, length.out = n+1)
  cutpoints = unname(quantile(x, quantiles, na.rm = TRUE))
  cut(x, cutpoints, include.lowest = TRUE)
}

library(data.table)
dt = data.table(A = 1:10, B = c(1,1,1,1,1,2,2,2,2,2))
dt[, bin := qcut(A, 3)]
dt[, bin2 := qcut(A, 3), by = B]

dt
A     B    bin        bin2
 1:  1 1  [1,4]    [6,7.33]
 2:  2 1  [1,4]    [6,7.33]
 3:  3 1  [1,4] (7.33,8.67]
 4:  4 1  [1,4]   (8.67,10]
 5:  5 1  (4,7]   (8.67,10]
 6:  6 2  (4,7]    [6,7.33]
 7:  7 2  (4,7]    [6,7.33]
 8:  8 2 (7,10] (7.33,8.67]
 9:  9 2 (7,10]   (8.67,10]
10: 10 2 (7,10]   (8.67,10]

此处没有分组的剪切是正确的 - 数据位于bin中。但是小组的结果是错误的。

我该如何解决?

1 个答案:

答案 0 :(得分:8)

这是处理因素的错误。请检查它是否已知(或在开发版本中修复),否则将其报告给data.table错误跟踪器。

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-surefire-plugin</artifactId>
    <version>${surefire-version}</version>
    <configuration>
        <parallel>methods</parallel>
        <threadCount>10</threadCount>
        <forkCount>2</forkCount>
        <reuseForks>true</reuseForks>
        <parallelTestsTimeoutInSeconds>300</parallelTestsTimeoutInSeconds>
        <groups>${testcase.groups}</groups>
    </configuration>
</plugin>