如何加快或并行化此R代码?

时间:2018-12-21 19:57:46

标签: r parallel-processing

此代码效果很好,但是有点慢。我注意到它仅在处理器的一个内核上运行。如果使用多个内核,可能会更快一些。

### proximity filter
options("scipen"=100)
library(geosphere)

# split up data into regions
splitdt<-split(geocities, geocities$airport_code)

## reduce cities
dat=geocities[FALSE,][]
currentregion=1

while (currentregion <= NROW(splitdt)){
    workingregion <- as.data.frame(splitdt[[currentregion]]) ## set region
    workingregion$remove = FALSE
    setDT(workingregion)
    #plot(workingregion$longitude,workingregion$latitude)
    currentorigin=1

    while (currentorigin <= NROW(workingregion)) {
        # choose which row to use
        # as the first part of the distance formula
        workingorigin <- workingregion[,c("longitude","latitude")] %>% slice(currentorigin) ## set LeadingRow city
        setDT(workingorigin)

        # calculate the distance from the specific row chosen
        # and only keep ones which are further than 20km
        workingregion<-workingregion %>% rowwise() %>% mutate(remove =
        ifelse(distHaversine(c(longitude, latitude), workingorigin) != 0 &  # keep workingorigin city
        distHaversine(c(longitude, latitude), workingorigin) < 17000,TRUE,workingregion$remove))

        # remove matched cities
        workingregion <- workingregion[workingregion$remove!=TRUE,]

        currentorigin = currentorigin+1
    }
    currentregion = currentregion+1
    # save results
    workingregion <- workingregion[workingregion$remove!=TRUE,]
    dat <- rbind(dat, workingregion) #, fill=TRUE
}

1 个答案:

答案 0 :(得分:1)

我注意到的第一行是: dat <- rbind(dat, workingregion)

这行代码在循环中动态增长矢量,不建议这样做,速度会很慢。

我知道在并行化此循环方面无法回答您的问题。但是,我只是进行了类似的练习,从100,000个SQL查询中收集结果,并且由于对内存的了解而使我的代码加速了60倍。

我还将代码与 foreach %dopar%并行。这是Windows的理想选择,并且易于设置群集(每个内核上R的实例)。

下面是一个可以帮助您的示例:

library(parallel)
library(doParallel)
library(snow)

# Uses all but one core
cl = makeCluster(detectCores() - 1)

# Necessary to give your instances of R on each core the necessary tools to do what 
# happens in loop 
clusterExport(cl, '<variable names>')
clusterEvalQ(cl, library(packages ))

# parallel loop for going through each region (in your case)
foreach(currentregion = splitdt) %dopar% # iterates over splitdt to cores
{
<body of loop>
}

# Shut down cluster
stopCluster(cl)
stopImplicitCluster()

这里有一些加速R代码的资源: http://adv-r.had.co.nz/Performance.html(由男子本人) https://csgillespie.github.io/efficientR/performance.html

希望这会有所帮助,祝你好运!