Question

有谁知道如何在两列数据表中快速有效地检查点P（x，y）的存在？示例代码：

dt <- data.table(x=c(1,2,3,4,5), y = c(2,3,4,5,6))
P <- c(2,3)

我想要的输出为TRUE（因为dt中的第二行包含我的点P）。我试过

 P %in% dt

但它只适用于第一行，我尝试使用循环，但没有多少希望 - 我正在寻找有效的'data.table'风格的解决方案。

Answer 1

扩展评论。

弗兰在评论中提到的What is the purpose of setting a key in data.table?中的Arun的帖子：

即使不这样做，除非你重复执行连接，否则键控连接和临时连接之间应该没有明显的性能差异。

和

因此，必须弄清楚重新排序整个data.table所花费的时间是否值得花时间进行缓存高效的连接/聚合。通常，除非在相同的键控数据表上执行重复的分组/连接操作，否则不应存在明显的差异。

因此，OP的快速有效地提供了数据。表格＆＃39; - 样式解决方案实际上取决于问题的维度，即数据集的大小和将要执行的搜索次数。

如果两者都很大，这里有一些时间：

数据：

library(data.table)
set.seed(0L)
M <- 1e7
dtKeyed <- data.table(x=1:M, y=2:(M+1)) #R-3.4.4 data.table_1.10.4-3 win-x64
dtNoKey <- copy(dtKeyed)
system.time(setkey(dtKeyed, x, y)) #not free
dtKeyed

nsearches <- 1e3
points <- apply(matrix(sample(M, nsearches*2, replace=TRUE), ncol=2), 1, as.list)

变体形式：

findPtNoKey <- function() {
    lapply(points, function(p) dtNoKey[p, on=names(dtNoKey), .N > 0, nomatch=0])
}

findPtOnKey <- function() {
    lapply(points, function(p) dtKeyed[p, on=names(dtKeyed), .N > 0, nomatch=0])
}

findPtKeyed <- function() {
    lapply(points, function(p) dtKeyed[p, .N > 0, nomatch=0])
}

library(microbenchmark)
microbenchmark(findPtKeyed(), findPtOnKey(), findPtNoKey(), times=3L)

定时：

#rem to add back the timing from setkey into the timing for findPtKeyed

Unit: milliseconds
          expr         min          lq        mean      median          uq         max neval
 findPtKeyed()    924.6846    928.3025    946.0892    931.9205    956.7914    981.6624     3
 findPtOnKey()   1119.9686   1129.5641   1143.4505   1139.1597   1155.1915   1171.2233     3
 findPtNoKey() 146186.2216 154934.5463 161016.1277 163682.8709 168431.0807 173179.2905     3

准确性检查：

ref <- findPtNoKey()

identical(findPtKeyed(), ref)
#[1] TRUE

identical(findPtOnKey(), ref)
#[1] TRUE

在两列数据表中查找点

1 个答案: