如何选择指定列中每组的前n行(连接后)?

时间:2017-06-20 04:33:26

标签: r data.table

我想在非等连接中将.SDby =的功能结合起来:

data.table - select first n rows within group

.EACHI in data.table

示例数据:

tmp_dt1<- data.table(grp = c(1,2), time = c(0.2, 0.6, 0.4, 0.8, 0.25, 0.65))
tmp_dt2 <- data.table(grp = c(1,2), time_from = c(0.1, 0.5))
tmp_dt2 <- tmp_dt2[, time_to := time_from + 0.2]

> tmp_dt1
   grp time
1:   1 0.20
2:   2 0.60
3:   1 0.40
4:   2 0.80
5:   1 0.25
6:   2 0.65
> tmp_dt2
   grp time_from time_to
1:   1       0.1     0.3
2:   2       0.5     0.7

现在,我想要的输出是每组中第一次位于tmp_dt2中定义的范围之间。我可以通过以下方式获得所有这些时间:

> tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), on = .(grp, time >= time_from, time <= time_to)]
   grp time time_from time_to
1:   1 0.20       0.1     0.3
2:   1 0.25       0.1     0.3
3:   2 0.60       0.5     0.7
4:   2 0.65       0.5     0.7

但是,我在使用grpby中提取前n行时遇到问题,而没有链接。例如,当n = 1时,所需的输出为:

tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), 
        on = .(grp, time >= time_from, time <= time_to)][, .SD[1], by = grp]

       grp time time_from time_to
1:   1  0.2       0.1     0.3
2:   2  0.6       0.5     0.7

但是,像:

> tmp_dt1[tmp_dt2, .(time = x.time[1], time_from[1], time_to[1]), on = .(grp, time >= time_from, time <= time_to), by = grp]
Error in `[.data.table`(tmp_dt1, tmp_dt2, .(time = x.time[1], time_from[1],  : 
  object 'time_from' not found

不起作用。

使用,.SD接近,但在所选列方面给出了令人困惑的结果:

tmp_dt1[tmp_dt2, .SD[1], on = .(grp, time >= time_from, time <= time_to), by = grp]
   grp time
1:   1  0.2
2:   2  0.6

我之所以不想在链中这样做是因为memory issues。请注意,我只对使用data.table包解决此特定问题感兴趣。

3 个答案:

答案 0 :(得分:2)

一种选择是指定mult= first

tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), mult = "first", 
             on = .(grp, time >= time_from, time <= time_to)]
#    grp time time_from time_to
#1:   1  0.2       0.1     0.3
#2:   2  0.6       0.5     0.7

答案 1 :(得分:2)

你试过吗

tmp_dt1[tmp_dt2, on=.(grp, time>=time_from, time<=time_to), 
    x.time, by=.EACHI] # or head(x.time, 2L) to get first 2 rows etc.

您需要自己重命名重复列,直到内部处理完毕,如here所述。

答案 2 :(得分:1)

如果你想最小化内存使用,另一个解决方案可能比原始链接方法更高效,即使将临时结果存储在变量中看起来很奇怪(但它只包含两列,每组只有前n行),仍然使用链接(但在原始数据的较小子集上):

n = 1       # parameter: first "n" rows per group
selected.rows <- tmp_dt1[tmp_dt2, .(rownum = .I[1:n]), on = .(grp, time >= time_from, time <= time_to), by = grp]
tmp_dt1[selected.rows$rownum][tmp_dt2, .(grp, time = x.time, time_from, time_to), on = .(grp, time >= time_from, time <= time_to)]

不是很优雅,也许更慢(它重复连接逻辑并需要连接两次 - 即使在第二种情况下减少子集)...

临时结果集包含每个&#34;匹配&#34;的行号。在原始数据表中(使用.I的{​​{1}}符号):

data.table

将此解决方案与使用真实大数据表的链接进行比较会很棒...(如果我有更多时间,我会对此进行分析)