Question

想象一下，你有一个相当大的 2.000.000点数据集随机分布在一些多边形空间上。密度函数必须在随机选择的4.000点样本中随时测量。这个过程必须重复200次。我的代码不能很好地解决这个问题。任何建议如何改进代码。

# coord is SpatialPoints Object 
library(sp)
library(maptools)
library(map)

您可以从以下链接获取多边形对象：https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l

germG <- readShapePoly("vg250_gem.shp")
coord <- spsample(germG, 2e06, "random") # this command needs some minutes to be done. 

# R is the number of simulations
R <- 200
M <- matrix(NA,R, 256)
ptm=proc.time()
for(r in 1:R) {
  ix <- sample(1:2e06,size=4000)
  Dg <- spDists(coord[ix])
  Dg <- as.vector(Dg[Dg!=0])
  kg <- density(Dg,bw="nrd0",n=256)
  M[r,] <- kg$y
}
runningtime = proc.time()-ptm   
cat("total run time (sec) =",round(runningtime[3],1),"\n")

上部代码通过使用Core i3,2.27Ghz，4个处理器和4 Gb RAM提供总运行时间（秒）= 964.8。如何加快这种环路仿真的性能？我将非常感谢你的所有评论，评论和建议。

Answer 1

不是答案，只是一些观察：

如果R =＃iterations，并且S =每次迭代的样本大小（例如，R = 200且S = 4000），那么您的运行时间将是~O（R×S ²）。因此，加倍运行和将样本量减半将使运行时间减少2倍。
spDists(...)中的默认距离指标是欧几里德。如果这是你想要的，那么你最好使用dist(..)函数 - 效率更高（参见下面的代码）。如果您想要地理距离（例如，Great Circle），则必须使用spDists(..., longlat=T)。
spDists(...)计算全距离矩阵，而不仅仅是下对角线。这意味着所有非零距离都会出现两次，这会影响您的密度。这就是下面代码中的M1和M2 不相等的原因。
对于大型S，对代码进行概要分析（使用Rprof）表明大约46％的时间花在density(...)上，另外38％花费在spDists(...)上。所以这是一个用apply，lapply等调用替换for循环的情况。
还有其他用于计算地理距离矩阵的函数 - 假设这是您想要的，但没有一个比您已经使用的更快。我在earth.dist(...)包中尝试了fossil，在distm(...)包中尝试了geosphere，在rdist.earth(...)包中尝试了fields，但这些都没有改进运行时间。

代码：

library(sp)
library(maptools)
germG <- readShapePoly("vg250_gem.shp")
coord <- spsample(germG, 1e4, "random") # Just 10,000 points...
R <- 200

# dist(...) and spDists(..., longlat=F) give same result
zz <- coord[sample(1e4,size=200)]
d1 <- spDists(zz)
d2 <- dist(zz@coords)
max(abs(as.matrix(d1)-as.matrix(d2)))
# [1] 0
# but dist(...) is much faster
M1 <- matrix(NA,R, 256)
set.seed(1)
system.time({
  for(r in 1:R) {
    ix <- sample(1e4,size=200)    # S = 200; test case
    Dg <- spDists(coord[ix])      # using spDists(...)
    Dg <- as.vector(Dg[Dg!=0])
    kg <- density(Dg,bw="nrd0",n=256)
    M1[r,] <- kg$y
  }
})
#    user  system elapsed 
#   11.08    0.17   11.28 

M2 <- matrix(NA,R, 256)
set.seed(1)
system.time({
  for(r in 1:R) {
    ix <- sample(1e4,size=200)    # S = 200; test case
    Dg <- dist(coord[ix]@coords)  # using dist(...)
    Dg <- as.vector(Dg[Dg!=0])
    kg <- density(Dg,bw="nrd0",n=256)
    M2[r,] <- kg$y
  }
})
# user  system elapsed 
# 2.67    0.03    2.73

修改以响应OP的请求。下面是大小= 200的分析代码。

R=200
M <- matrix(NA,R, 256)
Rprof("profile")
set.seed(1)
system.time({
  for(r in 1:R) {
    ix <- sample(1e4,size=200)    # S = 200; test case
    Dg <- spDists(coord[ix])      # using spDists(...)
    Dg <- as.vector(Dg[Dg!=0])
    kg <- density(Dg,bw="nrd0",n=256)
    M[r,] <- kg$y
  }
})
Rprof(NULL)
head(summaryRprof("profile")$by.total,10)
#                   total.time total.pct self.time self.pct
# "system.time"          11.52    100.00      0.02     0.17
# "spDists"               7.08     61.46      0.02     0.17
# "matrix"                6.76     58.68      0.24     2.08
# "apply"                 6.58     57.12      0.26     2.26
# "FUN"                   5.88     51.04      0.22     1.91
# "spDistsN1"             5.66     49.13      3.36    29.17
# "density"               3.18     27.60      0.02     0.17
# "density.default"       3.16     27.43      0.06     0.52
# "bw.nrd0"               1.98     17.19      0.00     0.00
# "quantile"              1.76     15.28      0.02     0.17

随着S变大，计算密度开始占主导地位，因为必须对结果进行排序。您可以使用ix <- sample(1e4,size=4000)运行此代码进行查看。

Answer 2

您可能会发现提前定义空白矩阵DG的速度要快一些。

除此之外，您可能需要考虑一个多路应用程序，从而获得足够的RAM空间。

空间分布/模拟/密度函数

2 个答案: