此代码效果很好,但是有点慢。我注意到它仅在处理器的一个内核上运行。如果使用多个内核,可能会更快一些。
### proximity filter
options("scipen"=100)
library(geosphere)
# split up data into regions
splitdt<-split(geocities, geocities$airport_code)
## reduce cities
dat=geocities[FALSE,][]
currentregion=1
while (currentregion <= NROW(splitdt)){
workingregion <- as.data.frame(splitdt[[currentregion]]) ## set region
workingregion$remove = FALSE
setDT(workingregion)
#plot(workingregion$longitude,workingregion$latitude)
currentorigin=1
while (currentorigin <= NROW(workingregion)) {
# choose which row to use
# as the first part of the distance formula
workingorigin <- workingregion[,c("longitude","latitude")] %>% slice(currentorigin) ## set LeadingRow city
setDT(workingorigin)
# calculate the distance from the specific row chosen
# and only keep ones which are further than 20km
workingregion<-workingregion %>% rowwise() %>% mutate(remove =
ifelse(distHaversine(c(longitude, latitude), workingorigin) != 0 & # keep workingorigin city
distHaversine(c(longitude, latitude), workingorigin) < 17000,TRUE,workingregion$remove))
# remove matched cities
workingregion <- workingregion[workingregion$remove!=TRUE,]
currentorigin = currentorigin+1
}
currentregion = currentregion+1
# save results
workingregion <- workingregion[workingregion$remove!=TRUE,]
dat <- rbind(dat, workingregion) #, fill=TRUE
}
答案 0 :(得分:1)
我注意到的第一行是: dat <- rbind(dat, workingregion)
这行代码在循环中动态增长矢量,不建议这样做,速度会很慢。
我知道在并行化此循环方面无法回答您的问题。但是,我只是进行了类似的练习,从100,000个SQL查询中收集结果,并且由于对内存的了解而使我的代码加速了60倍。
我还将代码与 foreach 和%dopar%并行。这是Windows的理想选择,并且易于设置群集(每个内核上R的实例)。
下面是一个可以帮助您的示例:
library(parallel)
library(doParallel)
library(snow)
# Uses all but one core
cl = makeCluster(detectCores() - 1)
# Necessary to give your instances of R on each core the necessary tools to do what
# happens in loop
clusterExport(cl, '<variable names>')
clusterEvalQ(cl, library(packages ))
# parallel loop for going through each region (in your case)
foreach(currentregion = splitdt) %dopar% # iterates over splitdt to cores
{
<body of loop>
}
# Shut down cluster
stopCluster(cl)
stopImplicitCluster()
这里有一些加速R代码的资源: http://adv-r.had.co.nz/Performance.html(由男子本人) https://csgillespie.github.io/efficientR/performance.html
希望这会有所帮助,祝你好运!