Question

考虑一个矩阵，每行指定一个二维区域，另一个矩阵指定平面中的点：

xmin <- c(3, 14, 25, 61)
xmax <- c(5, 18, 27, 65)
ymin <- c(33, 12, 83, 2)
ymax <- c(35, 16, 90, 6)
regions <- cbind(xmin, xmax, ymin, ymax)

x <- c(7, 26, 4, 16)
y <- c(4, 85, 30, 13)
points <- cbind(x, y)

获取regions中包含points中每个点的索引的最快方法是什么？

我想要实现的一个例子是：

apply(points, 1, function(x){
    which(regions[,'xmin'] < x[1] & regions[,'xmax'] > x[1] & regions[,'ymin'] < x[2] & regions[,'ymax'] > x[2])
})

但是，当regions和points中的行数接近1E5时，这变得相当慢，我正在寻找一种正确的矢量化方法......

提前致谢...

最好

托马斯

编辑：

对于任何有兴趣的人，我使用Rcpp在C ++中创建了一个函数，它提供了大约50倍的性能提升。我不熟练使用C ++，所以可能会做得更好......

cppFunction('
    IntegerVector findInRegion(NumericVector x, NumericVector y, NumericVector xmin, NumericVector xmax, NumericVector ymin, NumericVector ymax){
        int pointSize = x.size();
        int regionSize = xmin.size();
        IntegerVector ans(pointSize);
        for(int i = 0; i < pointSize; i++){
            ans[i] = NA_INTEGER;
        }

        for(int i = 0; i < pointSize; i++){
            for(int j = 0; j < regionSize; j++){
                if(x[i] > xmin[j]){
                    if(x[i] < xmax[j]){
                        if(y[i] > ymin[j]){
                            if(y[i] < ymax[j]){
                                ans[i] = j+1;
                            };
                        };
                    };
                };
            };
        };
        return ans;
    }
')

findRegion <- function(points, regions){
    if(!all(c('x', 'y') %in% colnames(points))){
        stop('points must contain columns named \'x\' and \'y\'')
    }
    if(!all(c('xmin', 'xmax', 'ymin', 'ymax') %in% colnames(regions))){
        stop('regions must contain columns named \'xmin\', \'xmax\', \'ymin\' and \'ymax\'')
    }
    findInRegion(points[, 'x'], points[,'y'], regions[, 'xmin'], regions[, 'xmax'], regions[, 'ymin'], regions[, 'ymax'])
}

这个功能的一个缺点是假设一个点只能属于一个区域......

Answer 1

这是一个非常有趣的问题。我做了一些初步测试，这个似乎就好像它可能更快，但我真的不知道它的扩展程度。如果您可以测试您的实际数据并报告一些时间，我会感兴趣：

#  Are X coords greater than xmin
lx <- outer( points[,1] , regions[,1] , ">" )

#  Are X coords less than xmax
hx <- outer( points[,1] , regions[,2] , "<" )

#  Ditto for Y coords
ly <- outer( points[,2] , regions[,3] , ">" )
hy <- outer( points[,2] , regions[,4] , "<" )

#  These matrices for X and Y points have 1 if coords is in range, 0 otherwise
inx <- lx * hx
iny <- ly * hy

#  The final result matrix has 1 if both X and Y coords are in range and 0 if not
#  Rows are points, columns are regions
res <- inx * iny

在100000点和100000个区域的数据上，除非您拥有的RAM，否则此方法将无效。但是我认为，如果将区域数量分成块，每个区域大约1000个，那么它是非常有用的。在我的桌面上，100,000点和1,000个区域耗时5秒：

Unit: seconds
        expr      min      lq  median       uq      max neval
 eval(simon) 4.528942 4.55258 4.59848 4.607572 4.671511     5

作为我在apply方法与此方法之间看到的时间差异程度的粗略指南，包括10,000个点和1,000个区域（基于5次运行）：

Unit: milliseconds
        expr       min        lq    median        uq       max neval
 eval(simon)  394.7165  402.0919  403.0491  404.6943  428.7077     5
    eval(OP) 1359.5889 1364.6308 1372.4980 1383.1327 1491.4628     5

拥有100,000点和1,000个地区（基于一次运行）：

Unit: seconds
        expr       min        lq    median        uq       max neval
 eval(simon)  4.352857  4.352857  4.352857  4.352857  4.352857     1
    eval(OP) 14.027390 14.027390 14.027390 14.027390 14.027390     1

这是我用来生成样本数据和运行基准测试的代码：

set.seed(4862)
xmin <- sample(25,1000,repl=T)
xmax <- xmin + sample(15,100,repl=T)
ymin <- sample(25,1000,repl=T)
ymax <- ymin + sample(15,1000,repl=T)
regions <- cbind(xmin, xmax, ymin, ymax)

x <- sample(25,100000,repl=T)
y <- sample(25,100000,repl=T)
points <- cbind(x, y)


OP <- quote({ res <- apply(points, 1, function(x){
    which(regions[,'xmin'] < x[1] & regions[,'xmax'] > x[1] & regions[,'ymin'] < x[2] & regions[,'ymax'] > x[2])
}) })


simon <- quote({
lx <- outer( points[,1] , regions[,1] , ">" )
hx <- outer( points[,1] , regions[,2] , "<" )
ly <- outer( points[,2] , regions[,3] , ">" )
hy <- outer( points[,2] , regions[,4] , "<" )
inx <- lx * hx
iny <- ly * hy
res <- inx * iny })

require(microbenchmark)
microbenchmark( eval(simon) , eval(OP) , times = 1L )

我建议你这样做。 HTH。

Answer 2

这是另一种解决方案，使用R-tree索引（一种用于存储边界框的数据库索引）与SQLite。事实证明它比Simon的稍微慢一点（7秒），可能是因为数据被复制到磁盘上。

# Sample data: data.frames, rather than matrices
regions <- data.frame(id=1:length(xmin), xmin, xmax, ymin, ymax)
points  <- data.frame(x, y)

library(RSQLite)
con <- dbConnect("SQLite", dbname = "/tmp/a.sqlite") 
dbGetQuery( con, "CREATE VIRTUAL TABLE regions USING rtree (id, xmin, xmax, ymin, ymax)" )
dbWriteTable( con, "regions", regions, row.names = FALSE, append = TRUE )
dbWriteTable( con, "points",  points, row.names = TRUE )
res <- dbGetQuery( con, "
  SELECT points.row_names, regions.id
  FROM   points, regions
  WHERE  xmin <= x AND x <= xmax
  AND    ymin <= y AND y <= ymax
" )

在二维窗口中有效搜索

2 个答案: