Question

我正在使用data.table对大型数据集（45M行，4个int列）进行一些重复查找。

这是我想要做的。

library(data.table)
# generate some data, u's can show up in multiple s's
d1 <- data.table(u=rep(1:500,2), s=round(runif(1000,1,100),0))
setkey(d1, u, s)

# for each u, I want to lookup all their s's
us <- d1[J(u=1), "s", with=F]
# for each of the s's in the above data.table, 
#   I want to lookup other u's from the parent data.table d1

# DOESN'T WORK:
otherus <- d1[J(s = us), "u", with=F]   

# THIS WORKS but takes a really long time on my large dataset:
otherus <- merge(d1, us, by='s')

合并适用于我的目的但是因为我的'd1'＆gt;＆gt;＆gt; '我们'，需要很长时间。起初我想也许我正在使用基础的合并，但基于文档它看起来像data.table merge是调度类（first_arg to merge）是data.table。

我仍然习惯于data.table J（）语法。有没有更快的方法来实现这个目标？

提前致谢。

Answer 1

您可以为此目的更改密钥。

setkey(d1,s,u)

在该命令之后，相同u值的所有s值被组合在一起。

        u   s
   1:  20   1
   2:  35   1
   3:  36   1
   4:  87   1
   5: 123   1
  ---        
 996: 208 100
 997: 262 100
 998: 352 100
 999: 430 100
1000: 455 100

对键列定义的组执行的操作通常非常快，例如。

d1[,mean(u),keyby='s']

如果您需要对u和s两个群组进行快速聚合，则可以存储data.table的两个实例。对于一个您使用setkey(d1,u,s)而另一个setkey(d1,s,u)。如果要对由u的值定义的组快速执行操作，请使用以前的data.table，否则使用后者。

Answer 2

以下是否有效？

d1 <- data.table(u=rep(1:500,2), s=round(runif(1000,1,100),0))
setkey(d1, u, s)
us <- d1[J(u=1), "s", with=F]
otherus <- merge(d1, us, by='s') 

setkey(d1,s)
otherus2 <- d1[us]
identical(otherus2, otherus)

setkey(d1, u, s)

如何使用data.table进行多键查找？

2 个答案: