Question

使用R，我需要以最快的方式为给定的栅格（来自包raster）选择有效范围。我试过这个：

library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)

r <- raster(ncol=100, nrow=100)
r[] <- runif(ncell(r))

#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)

#Benchmark!
compare <- microbenchmark(
    calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}), 
    reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
    g(r, 0.2, 0.8),
    times=100)
autoplot(compare) #Reclassify is much faster, precompiling doesn't help much.

#Check they are the same...
identical(
          calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
          reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA))
) #TRUE
identical(
          reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
          g(r, 0.2, 0.8),
) #TRUE

重新分类方法要快得多，但我确信它可以加快速度。我怎么能这样做？

Answer 1

虽然对于示例栅格这个问题的接受答案是正确的，但重要的是要注意最快的安全函数高度依赖于栅格大小：@ h和i所呈现的函数rengis只有相对较小的栅格才能更快（相对简单的重新分类）。只需将OP示例中的栅格r的大小增加10，即可reclassify更快：

# Code from OP @AF7
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)

#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)

# Funcions from @rengis
h <- function(r, min, max) {
  rr <- r[]
  rr[rr < min | rr > max] <- NA
  r[] <- rr
  r
}

i <- cmpfun(h)

# Benchmark with larger raster (100k cells, vs 10k originally)
r <- raster(ncol = 1000, nrow = 100)
r[] <- runif(ncell(r))

compare <- microbenchmark(
  calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}), 
  reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
  g(r, 0.2, 0.8),
  h(r, 0.2, 0.8),
  i(r, 0.2, 0.8),
  times=100)
autoplot(compare)

reclassify变得更快时的确切点取决于栅格中单元格的数量和重分类的复杂程度，但在这种情况下，交叉点大约为50,000个单元格（请参阅下面）。

随着栅格变得更大（或计算更复杂），加速重分类的另一种方法是使用多线程，例如使用snow包：

# Reclassify, using clusterR to split into two threads
library(snow)
tryCatch({
      beginCluster(n = 2)
      clusterR(r, reclassify, args = list(rcl = c(-Inf, 0.2, NA, 0.8, Inf, NA)))
    }, finally = endCluster())

多线程涉及更多的设置开销，因此只有非常大的栅格和/或更复杂的计算才有意义（事实上，我很惊讶地注意到它并不是最好的选择。我在下面测试的任何条件 - 可能是更复杂的重新分类？）。

为了说明这一点，我使用OP的设置绘制了微基准测试的结果，间隔高达1000万个细胞（每个10个细胞）：

作为最后一点，编译并没有对任何测试尺寸产生影响。

Answer 2

这是另一种方式：

h <- function(r, min, max) {
  rr <- r[]
  rr[rr < min | rr > max] <- NA
  r[] <- rr
  r
}

i <- cmpfun(h)

identical(
  i(r, 0.2, 0.8),
  g(r, 0.2, 0.8)
)



#Benchmark!
compare <- microbenchmark(
  calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}), 
  reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
  g(r, 0.2, 0.8),
  h(r, 0.2, 0.8),
  i(r, 0.2, 0.8),
  times=100)
autoplot(compare)

在这种情况下编译没有多大帮助。

通过使用@直接访问光栅对象的插槽，您甚至可以进一步加快速度（虽然通常不鼓励）。

j <- function(r, min, max) {
  v <- r@data@values
  v[v < min | v > max] <- NA
  r@data@values <- v
  r
}

k <- cmpfun(j)

identical(
  j(r, 0.2, 0.8)[],
  g(r, 0.2, 0.8)[]
)

Answer 3

raster包具有以下功能：clamp。它比g快，但比h和i慢，因为它内置了一些开销（安全）。

compare <- microbenchmark(
  h(r, 0.2, 0.8),
  i(r, 0.2, 0.8),
  clamp(r, 0.2, 0.8),
  g(r, 0.2, 0.8),
  times=100)
autoplot(compare)

选择栅格数据有效范围的最快方法

3 个答案: