Question

不确定如何最好地提出这个问题，如果这里有更多标准词汇，请随时编辑问题标题。

我在R中有两个2列数据表，第一个是唯一的2变量值（u）列表，比第二个更短，这是类似值（d）的原始列表。我需要一个函数，对于u中的每个2变量值集，找到d中所有2个变量值的集合，其中两个变量都在给定的阈值内。

这是一个极小的例子。实际数据要大得多（见下文，因为这是问题）并且（显然）不是如示例中那样随机创建的。在实际数据中，你将有大约600,000到1,000,000个值（行），d将有超过10,000,000行。

# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
                PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))

# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
                PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)

# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1

所以，我第一次尝试这样做是为了所有u值的for循环。这很好用，但计算量很大，需要很长时间来处理实际数据。

# Make a list to output the list of within-threshold  rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
  for(i in 1:nrow(u)){
      m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
  } 
})
m

有效。但我认为使用apply（）函数会更有效率。这是......

# Make the user-defined function for the threshold matching
match <- function(x,...){
  which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
  m <- apply(u,1,match)
})

同样，这个apply函数可以工作，并且比for循环略快，但只是略有增加。这可能只是一个大数据问题，我需要更多的计算能力（或更多的时间！）。但我认为其他人可能会对偷偷摸摸的命令或函数语法有所了解，这会大大加快这一点。在盒子外面找到这些匹配行的方法也很受欢迎。

Answer 1

有点偷偷摸摸：

library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)

一旦行数足够大，应该快。我们乘以100进入整数空间。将参数的顺序颠倒到findOverlaps可以提高性能。

Answer 2

唉，这似乎只比for循环

稍快

unlist(Map(function(x,y) {
    which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))

但至少它是什么。

Answer 3

我有一个狡猾的计划:-)。如何进行计算：

> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
           [,1]      [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
     [,1] [,2]
[1,]   -1    1
[2,]    1    1
[3,]   -1    1
[4,]    1   -1
[5,]    1    1

现在，您所要做的就是使用c(-1,1)选择最后一个矩阵which的行，然后您可以轻松地从bar矩阵中提取所需的行。对foo中的每一行重复。

需要一个更有效的阈值匹配R的功能

3 个答案: