data.table - 选择组中的前n行

时间:2016-01-12 20:20:44

标签: r data.table

虽然很简单,但我不知道在data.table解决方案中选择数据表中组的前n行。你能帮帮我吗?

2 个答案:

答案 0 :(得分:47)

作为替代方案:

function longPoll() { 
        var shouldDelay = false;

        $.ajax({
            url: 'poll.php',
            async: true,            // by default, it's async, but...
            dataType: 'json',       // or the dataType you are working with
            timeout: 10000,          // IMPORTANT! this is a 10 seconds timeout
            cache: false

        }).done(function (data, textStatus, jqXHR) {
             // do something with data...

        }).fail(function (jqXHR, textStatus, errorThrown ) {
            shouldDelay = textStatus !== "timeout";

        }).always(function() {
            // in case of network error. throttle otherwise we DOS ourselves. If it was a timeout, its normal operation. go again.
            var delay = shouldDelay ? 10000: 0;
            window.setTimeout(longPoll, delay);
        });
}
longPoll(); //fire first handler

当您查看示例数据集的速度时,dt[, .SD[1:3], cyl] 方法与.I method of @eddi相同。与head包比较:

microbenchmark

结果:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

但是,Unit: relative expr min lq mean median uq max neval cld head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10 a SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401 10 b I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973 10 a 专为大型数据集而设计。所以,再次运行这个比较:

data.table

结果:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

现在Unit: relative expr min lq mean median uq max neval cld head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876 10 b SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462 10 b I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a 方法显然是最快的方法。

更新2016-02-12:

使用 data.table 包的最新开发版本,.I方法仍然获胜。 .I方法或.SD方法是否更快似乎取决于数据集的大小。现在基准测试给出了:

head()

然而,如果数据集稍微小一些(但仍然很大),则可能性发生变化:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

基准现在略微支持largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl] 方法上的head方法:

.SD

答案 1 :(得分:10)

我们可以将head.SD

一起使用
library(data.table)

dt <- data.table(mtcars)

> dt[, head(.SD, 3), by = "cyl"]

   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
1:   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
2:   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
3:   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
4:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1
5:   4 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2
6:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2
7:   8 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2
8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4
9:   8 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3