Question

假设我有向量x ：

非常大（> 200 000）
是整数
已排序
所有值都是唯一的

我想检查此向量中是否有整数值y ，如果是，我想得到它的索引。我想利用这个事实，即矢量被排序，因此可以快速完成。

我将如何完成这件事？

Answer 1

这是一些数据

f0 =  function(y, vec) y %in% vec
f1 = function(y, vec) vec[findInterval(y, vec)] == y

还有几种方法

%in%

findInterval()进行全面扫描; > identical(f0(y, x), f1(y, x)) [1] TRUE进行二分搜索（我认为）。它们会产生相同的结果

> library(microbenchmark)
> microbenchmark(f0(y, x), f1(y, x), times=10)
Unit: milliseconds
     expr      min        lq      mean    median        uq       max neval
 f0(y, x) 99.35425 100.87319 102.32160 102.20107 103.67718 105.70854    10
 f1(y, x) 94.83219  95.05068  95.93625  95.77822  96.72601  97.50961    10

具有大致相似的摊销绩效

findInterval()

但我认为> microbenchmark(f0(y[1:10], x), f1(y[1:10], x), times=10) Unit: milliseconds expr min lq mean median uq max neval f0(y[1:10], x) 83.441578 85.116818 86.264751 86.07515 87.13516 89.430801 10 f1(y[1:10], x) 7.731606 7.734207 7.757201 7.75199 7.77210 7.810957 10对于小型查询来说更快

f2 = function(x, vec) vec[which.max(x == vec)] == x

大卫建议（我认为）

which.max()

y仅适用于标量findInterval()，它很少（为OP的利益说这个）很好地利用了R.它看起来性能不如> microbenchmark(f1(x[1000], x), f2(x[1000], x), times=10) Unit: milliseconds expr min lq mean median uq max neval f1(x[1000], x) 7.707420 7.709047 7.714576 7.711979 7.718953 7.729688 10 f2(x[1000], x) 9.353225 9.358874 9.381781 9.378680 9.400808 9.426102 10

which()

与@Laterow相反，我认为which.max()和> set.seed(123) ; x <- sample(2e5, replace = TRUE) > microbenchmark(which.max(x == 1e7), which(x == 1e7)[1]) Unit: milliseconds expr min lq mean median uq max which.max(x == 1e+07) 4.240606 4.266470 5.975966 5.015947 5.217903 43.78467 which(x == 1e+07)[1] 4.060040 4.132667 5.550078 4.986287 5.059128 43.88074 neval 100 100之间没有任何特别的性能差异（在当前的R-devel或R-3-2-branch中;结果也不是同样的，所以这是一个苹果与橘子的比较）。在过去的6个月里，我对R-devel的谈话有一个模糊的回忆......

which

which.max与which.max()的效果可能会因this commit而发生变化，之前[{ "Shop_name": "916", "Shop_id": "916TCR", "Address":"cdsasffafa" "numbers": "4", "mob_no": "9447722856" }, { "Shop_name": "Chicking", "Shop_id": "CKGTCR", "Address":"afagagg", "numbers": "8", "mob_no": "6767564532" }]会在扫描之前强制逻辑到数字向量，从而触发副本。

R：检查整数值是否在排序整数向量中并返回其索引的快速方法

1 个答案: