Question

Let be a sequence of TRUE and FALSE in R

v = c(F,F,F,F,F,F,T,F,T,T,F,T,T,T,T,T,F,T,F,T,T,F,F,F,T,F,F,F,F,F)

I would like to get the the positions of the first and the last TRUE. One way to achieve this is

range(which(v)) # 7 25

but this solution is relatively slow as it must check every element of the vector to get the position of each TRUE and then loop over all positions, evaluating two if statements at each position (I think) in order to get the maximum and the minimum values. It would be much more strategic to search for the first TRUE starting one from the beginning and one from the end and just return those positions.

Is there a faster alternative to range(which(..))?

Answer 1

The simplest approach I can think of that doesn't involve searching the entire vector would be an Rcpp solution:

$adapter.disable()

We can benchmark on a fairly long vector (length 1 million) with random entries. We would expect to get pretty large efficiency gains from not searching through the whole thing with library(Rcpp) cppFunction( "NumericVector rangeWhich(LogicalVector x) { NumericVector ret(2, NumericVector::get_na()); int n = x.size(); for (int idx=0; idx < n; ++idx) { if (x[idx]) { ret[0] = idx+1; // 1-indexed for R break; } } if (R_IsNA(ret[0])) return ret; // No true values for (int idx=n-1; idx >= 0; --idx) { if (x[idx]) { ret[1] = idx + 1; // 1-indexed for R break; } } return ret; }") rangeWhich(v) # [1] 7 25:

which

The Rcpp solution is a good deal faster (more than 500x faster) than set.seed(144) bigv <- sample(c(F, T), 1000000, replace=T) library(microbenchmark) # range_find from @PierreLafortune range_find <- function(v) { i <- 1 while(!v[i]) { i <- i +1 } j <- length(v) while(!v[j]) { j <- j-1 } c(i,j) } # shortCircuit from @JoshuaUlrich shortCircuit <- compiler::cmpfun({ function(x) { first <- 1 while(TRUE) if(x[first]) break else first <- first+1 last <- length(x) while(TRUE) if(x[last]) break else last <- last-1 c(first, last) } }) microbenchmark(rangeWhich(bigv), range_find(bigv), shortCircuit(bigv), range(which(bigv))) # Unit: microseconds # expr min lq mean median uq max neval # rangeWhich(bigv) 1.476 2.4655 9.45051 9.0640 13.7585 46.286 100 # range_find(bigv) 1.445 2.2930 8.06993 7.2055 11.8980 26.893 100 # shortCircuit(bigv) 1.114 1.6920 7.30925 7.0440 10.2210 30.758 100 # range(which(bigv)) 6821.180 9389.1465 13991.84613 10007.9045 16698.2230 58112.490 100 because it doesn't need to iterate through the whole vector with max(which(v)). For this example it has a near-identical runtime (in fact, slightly slower) than which from @PierreLafortune and range_find from @JoshuaUlrich.

Using Joshua's excellent example of some worst-case behavior where the true value is in the very middle of the vector (I'm repeating his experiment with all proposed functions so we can see the whole picture), we see a very different situation:

shortCircuit

For this vector the looping base R solutions are much slower than the original solution (100-600x slower) and the Rcpp solution is barely faster than bigv2 <- rep(FALSE, 1e6) bigv2[5e5-1] <- TRUE bigv2[5e5+1] <- TRUE microbenchmark(rangeWhich(bigv2), range_find(bigv2), shortCircuit(bigv2), range(which(bigv2))) # Unit: microseconds # expr min lq mean median uq max neval # rangeWhich(bigv2) 546.206 555.3820 593.1385 575.3790 599.055 979.924 100 # range_find(bigv2) 400057.083 406449.0075 434515.1142 411881.4145 427487.041 697529.163 100 # shortCircuit(bigv2) 74942.612 75663.7835 79095.3795 76761.5325 79703.265 125054.360 100 # range(which(bigv2)) 632.086 679.0955 761.9610 700.1365 746.509 3924.941 100 (which makes sense, because they're both looping through the whole vector once).

As usual, this needs to come with a disclaimer -- you need to compile your Rcpp function, which also takes time, so this will only be a benefit if you have very large vectors or are repeating this operation many times. From the comments on your question it sounds like you indeed have a large number of large vectors, so this could be a good option for you.

Answer 2

var counter = 0; function rateLimit() { counter++; if (counter == 3) { alert('3 times in a second! - run a function!'); //some code ... counter = 0; } } setInterval(function() { counter = 0; }, 1000); is quick as it stops when it finds the value searched for:

match

But you would have to test the speeds.

Update:

c(match(T,v),length(v)-match(T,rev(v))+1)
[1]  7 25

Benchmark

range_find <- function(v) {
i <- 1
j <- length(v)
while(!v[i]) {
  i <- i+1
}
while(!v[j]) {
  j <- j-1
}
c(i,j)
}

This approach matches your proposed strategy:

"It would be much more strategic to search for the first TRUE starting one from the beginning and one from the end and just return those positions."

Answer 3

Just for fun. The simplest approach I can think of that doesn't involve searching the entire vector or Rcpp :P

extension Dictionary {
    subscript(i: Int) -> (key: Key, value: Value) {
        return self[index(startIndex, offsetBy: i)]
    }
}

Woohoo, I win! Oh, wait... let's compare the two on the worst possible case.

shortCircuit <- compiler::cmpfun({
  function(x) {
    first <- 1
    while(TRUE) if(x[first]) break else first <- first+1
    last <- length(x)
    while(TRUE) if(x[last]) break else last <- last-1
    c(first, last)
  }
})
set.seed(144)
bigv <- sample(c(F, T), 1000000, replace=T)
library(microbenchmark)
microbenchmark(rangeWhich(bigv), shortCircuit(bigv))
# Unit: microseconds
#                expr   min     lq median     uq   max neval
#    rangeWhich(bigv) 1.722 1.8875 1.9995 2.1400 6.850   100
#  shortCircuit(bigv) 1.053 1.1905 1.3245 1.4545 9.207   100

Oh no... I lost, badly. Oh well, at least I had fun. :)

Faster alternative to `range(which(..))`

3 个答案: