制作功能,检查矢量是否存在于矩阵中更快

时间:2016-08-04 02:28:54

标签: r matrix vector

我有以下函数(funtest)来测试矩阵中是否存在特定的向量。向量将始终为长度2,矩阵将始终具有两列。该函数工作正常,我只想让它更快(理想情况下更快),因为我的矩阵可以有数百到数千行。

x = c(1,2)

set.seed(100)
m <- matrix(sample(c(1,-2,3,4), 500*2, replace=TRUE), ncol=2)

funtest(m,x)
[1] TRUE 

这是目前的速度

library(microbenchmark)
microbenchmark(funtest(m, x), times=100)
Unit: milliseconds
          expr      min       lq     mean   median       uq      max
 funtest(m, x) 1.501247 1.536157 1.674668 1.567826 1.708293 2.900046
 neval
   100

这是功能

funtest = function(m, x) {
    out = any(apply(m,1,function(n,x) all(n==x),x=x))
    return(out)
}

3 个答案:

答案 0 :(得分:3)

怎么样

paste(x[1], x[2], sep='&') %in% paste(m[,1], m[,2], sep='&')

这应该是超级高效的!它基于匹配。一旦找到第一场比赛,就不会再进行搜索了!

但我确信这不是最快的。最佳解决方案是使用单个while循环在C代码中编写此操作。但是,潜在的加速因子不应超过2。

答案 1 :(得分:3)

这是一个Rcpp(特别是Rcpp Armadillo)的方法。基准在最后给出:

Haskell

基准在这里:(编辑:我已经为@ zheyuan-li添加了一个非常简单的解决方案的基准;它被称为pasteFn)

# Import the relevant packages (All for compiling the C++ code inline)
library(Rcpp)
library(RcppArmadillo)
library(inline)

# We need to include these namespaces in the C++ code 
includes <- '
using namespace Rcpp;
using namespace arma;
'

# This is the main C++ function 
# We cast 'm' as an Armadillo matrix 'm1' and compute the number of rows 'numRows'
# We cast 'x' as a row vector 'x1'
# We then loop through the rows of the matrix 
# As soon as we find a matching row (anyEqual = TRUE), we stop and return TRUE
# If no matching row is found, then anyEqual = FALSE and we return FALSE
# Note: Within the for loop, we do an elementwise comparison of a row of m1 to x1
# If the row is equal to x1, then the sum of the elementwise comparision should equal the number of elements of x1
src <- '
mat m1 = as<mat>(m); 
int numRows = m1.n_rows;
rowvec x1 = as<rowvec>(x);
bool anyEqual = FALSE;
for (int i = 0; i < numRows & !anyEqual; i++){
    anyEqual = (sum(m1.row(i) == x1) == x1.size());
}
return(wrap(anyEqual));
'

# Here, we compile the function above
# Do this once (in a given R session) and use it as many times as desired
rcppFn <- cxxfunction(signature(m="numeric", x="numeric"), src, plugin='RcppArmadillo', includes)

编辑:如果您想使用矩阵&#39; x&#39;相反,以下源代码应该工作

# Your function is called funtest
# Rcpp function is rcppFn
# Zheyuan's solution is pasteFn
microbenchmark(funtest(m, x), rcppFn(m, x), pasteFn(m, x), times=100, unit = "ms")
Unit: milliseconds
          expr      min        lq       mean    median        uq      max neval
 funtest(m, x) 1.127903 1.1984755 1.30559130 1.2514455 1.3431040 2.641258   100
  rcppFn(m, x) 0.005420 0.0061355 0.00879676 0.0073660 0.0084130 0.030305   100
 pasteFn(m, x) 0.741269 0.7610905 0.79174042 0.7752145 0.8228895 0.894389   100

这里,我只是检查x的每一行,是否存在于m中。与原始代码非常相似,只是有一个额外的for循环。它将返回1或0,具体取决于是否匹配(没有足够的经验与RcppArmadillo创建一个bool矢量)。

答案 2 :(得分:3)

base::bitwXor()将为两个整数之间的匹配生成0

注意: bitwXor()仅适用于整数

编辑:添加了与0的{​​{1}}的比较,并添加了data.table解决方案

bitwXor

Data.Table解决方案:

library(microbenchmark)
set.seed(100)
m <- matrix(sample(c(1,-2,3,4), 500*2, replace=TRUE), ncol=2)

fun1 <- function(m,x) {any(apply(m,1,function(n,x) all(n==x),x=x))}
fun2 <- function(m,x) {paste(x[1], x[2], sep='&') %in% paste(m[,1], m[,2], sep='&')}
fun3 <- function(m,x) {any((bitwXor(m[,1], x[1]) == 0) & (bitwXor(m[,2], x[2]) == 0))}
fun4 <- function(m,x) {setDT(m)[X1 == x[1] & X2 == x[2], .N > 0]}

x <-  c(1,2)

microbenchmark(fun1(m,x),     # @user3067923
               fun2(m,x),     # @Zheyuan Li
               rcppFn(m, x),  # @jav
               fun3(m,x),
               times = 1000)

# Unit: microseconds
#         expr      min       lq       mean   median       uq      max neval
#   fun1(m, x) 1802.483 1920.007 2156.93459 1995.865 2094.820 9915.013  1000
#   fun2(m, x) 1540.716 1602.534 1674.39556 1641.256 1702.848 2832.344  1000
# rcppFn(m, x)   14.040   16.305   23.43586   21.739   29.439   95.107  1000
#   fun3(m, x)   70.650   76.992   86.36290   82.879   88.766  314.303  1000