Question

我有两个正整数向量，指定开始和结束＆＃34;位置＆＃34;范围

starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)

因此，这些指定1000000个100到1000个单位的范围。现在我想知道一个位置（正整数）是多少次＆＃34;覆盖＆＃34;按范围。为此我做了：

coverage <- integer(max(ends))
for(i in seq(length(starts))) {
      coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1 
}

但由于for循环，它相对较慢。对于数十亿的范围，可能需要很长时间。我找不到一种方法来矢量化这段代码。我可以拆分工作并使用多个CPU，但速度增益很小。 apply，lapply和其他元函数不会提高速度（如预期的那样）。例如

coverage <- tabulate(unlist(Map(':', starts,ends)))

由于＆＃34; Map＆＃34;

也很慢部分。我担心它也需要更多的记忆。

有什么想法吗？

Answer 1

您可以保留在任何特定索引处开始和结束的范围计数，然后对这些范围的差异应用累积总和。

汇总从每个索引开始的范围数
汇总在每个索引之前在一个位置结束的范围数（如果ends包括在内）
计算净变动：count of starts - count of ends
循环索引并累计汇总净变化。这将给出早于此索引开始且尚未在此索引处结束的数字范围。

“覆盖”数字等于每个指数的累积总和。

我尝试使用稀疏向量来减少内存使用量。虽然使用法向量可能会更快，但不确定。使用sparseVector，它比给定示例的循环方法快5.7倍。

library(Matrix)

set.seed(123)

starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)

v.cov <- NULL
fun1 <- function() {
  coverage <- integer(max(ends))
  for(i in seq(length(starts))) {
    coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1 
  }
  v.cov <<- coverage
}
# Testing "for loop" approach
system.time(fun1())
# user  system elapsed 
# 21.84    0.00   21.83 

v.sum <- NULL
fun2 <- function() {      
  # 1. Aggregate the number of ranges that start at each index
  t.starts <- table(starts)
  i.starts <- strtoi(names(t.starts))
  x.starts <- as.vector(t.starts)
  sv.starts <- sparseVector(x=x.starts, i=i.starts, length=max(ends)+1)  # to match length of sv.ends below
  # 2. Aggregate the number of ranges that end at one position before each index
  t.ends <- table(ends)
  i.ends <- strtoi(names(t.ends))+1  # because "ends" are inclusive 
  x.ends <- as.vector(t.ends)
  sv.ends <- sparseVector(x=x.ends, i=i.ends, length=max(ends)+1)

  sv.diff <- sv.starts - sv.ends
  v.sum <<- cumsum(sv.diff)[1:max(ends)]  # drop last element
}
# Testing "cumulative sum" approach
system.time(fun2())
# user  system elapsed 
# 3.828   0.000   3.823

identical(v.cov, v.sum)
# TRUE

此外，对于sparseVector构造函数，提取x和i的方法可能比使用table和strtoi(names(x))更好，这可能会进一步提高速度。

修改

避免strtoi使用1列sparseMatrix代替

v.sum.mat <- NULL fun3 <- function() { v.ones <- rep(1, length(starts)) m.starts <- sparseMatrix(i=starts, j=v.ones, x=v.ones, dims=c(max(ends)+1,1)) m.ends <- sparseMatrix(i=ends+1, j=v.ones, x=v.ones, dims=c(max(ends)+1,1)) m.diff <- m.starts - m.ends v.sum.mat <<- cumsum(m.diff[,1])[1:max(ends)] } # Testing "cumulative sum" approach using matrix system.time(fun3()) # user system elapsed # 0.456 0.028 0.486 identical(v.cov, v.sum.mat) # TRUE

编辑2 - 超快，超短

基于@alexis_laz的评论，谢谢！

fun4 <- function() { cumsum(tabulate(starts, max(ends) + 1L) - tabulate(ends + 1L, max(ends) + 1L))[1:max(ends)] } system.time(v.sum.tab <- fun4()) # user system elapsed # 0.040 0.000 0.041 identical(as.integer(v.cov), v.sum.tab) # TRUE

加速简单的R代码（矢量化？）

1 个答案: