加快在R中庞大数据集上计算mann-kendall检验的并行过程

时间:2019-02-04 17:31:25

标签: r parallel-processing

让我们假设每个月的时间步长都有大量的全球气候数据集。然后,数据集的形状为以下类型的data.frame

lon,lat,data_month_1_yr_1,...,data_month_12_yr_100

示例:

set.seed(123)
data<- data.frame(cbind(runif(10000,-180,180), runif(10000,-90,90))
, replicate(1200, runif(10000,0,150)))

我想在每个空间点的每月时间序列上执行Mann-Kendall测试(使用trend::mk.test,并在data.frame中获得主要统计信息。为了加快这一漫长的过程,我并行执行了代码,并编写了如下内容:

coords<-data[,1:2] #get the coordinates out of the initial dataset
names(coords)<-c("lon","lat") 
data_t<- as.data.frame(t(data[,3:1202])) #each column is now the time series associated to a point
data_t$month<-rep(seq(1,12,1),100) # month index as last column of the data frame
# start the parallel processing

library(foreach)
library(doParallel)

cores=detectCores() #count cores
cl <- makeCluster(cores[1]-1) #take all the cores minus 1 not to overload the pc
registerDoParallel(cl)

mk_out<- foreach(m=1:12, .combine = rbind) %:%
         foreach (a =1:10000, .combine = rbind) %dopar% {

           data_m<-data_t[which(data_t$month==m),]
           library(trend) #need to load this all the times otherwise I get an error (don't know why)
           test<-mk.test(data_m[,a])
           mk_out_temp <- data.frame("lon"=coords[a,1],
                                     "lat"=coords[a,2],
                                     "p.value" = as.numeric(test$p.value),
                                     "z_stat" = as.numeric(test$statistic),
                                     "tau" = as.numeric(test$estimates[3]),
                                     "month"= as.numeric(m))
           mk_out_temp
}
stopCluster(cl)

head(mk_out)
         lon       lat    p.value     z_stat         tau month
1  -76.47209 -34.09350 0.57759040 -0.5569078 -0.03797980     1
2  103.78985 -31.58639 0.64436238  0.4616081  0.03151515     1
3  -32.76831  66.64575 0.11793238  1.5635113  0.10626263     1
4  137.88627 -30.83872 0.79096910  0.2650524  0.01818182     1
5  158.56822 -67.37378 0.09595919 -1.6647673 -0.11313131     1
6 -163.59966 -25.88014 0.82325630  0.2233588  0.01535354     1

这运行得很好,并且给我确切的含义:一个矩阵,报告每个坐标和月份组合的M-K统计信息。尽管该过程是并行的,但是计算仍要花费大量时间。

有没有办法加快这一过程? apply系列有使用功能的空间吗?

2 个答案:

答案 0 :(得分:1)

您注意到您已经解决了问题。可通过以下步骤之一获得:

1:使用.packages.export将必要的对象复制到foreach循环中。这样可以确保每个实例在尝试访问相同的内存时都不会发生冲突。

2:利用高性能库(例如data.table的tidyverse)执行子集和计算。

后者稍微复杂一点,但极大地提高了我纤巧的微型笔记本电脑的性能。 (整个数据集大约需要1.5分钟来执行所有计算。)

以下是我添加的代码。请注意,我用并行包中的单个parLapply函数替换了foreach。

set.seed(123)
data<- data.frame(cbind(runif(10000,-180,180), runif(10000,-90,90))
                  , replicate(1200, runif(10000,0,150)))

coords<-data[,1:2] #get the coordinates out of the initial dataset
names(coords)<-c("lon","lat") 
data_t<- as.data.frame(t(data[,3:1202])) #each column is now the time series associated to a point
data_t$month<-rep(seq(1,12,1),100) # month index as last column of the data frame
# start the parallel processing

library(data.table)
library(parallel)
library(trend)
setDT(data_t)
setDT(coords)
cores=detectCores() #count cores
cl <- makeCluster(cores[1]-1) #take all the cores minus 1 not to overload the pc

#user  system elapsed 
#17.80   35.12   98.72
system.time({
  test <- data_t[,parLapply(cl, 
                            .SD, function(x){
                              (
                                unlist(
                                  trend::mk.test(x)[c("p.value","statistic","estimates")]
                                )
                               )
                              }
                            ), by = month] #Perform the calculations across each month
  #create a column that indicates what each row is measuring
  rows <- rep(c("p.value","statistic.z","estimates.S","estimates.var","estimates.tau"),12)

  final_tests <- dcast( #Cast the melted structure to a nice form
                      melt(cbind(test,rowname = rows), #Melt the data for a better structure
                        id.vars = c("rowname","month"), #Grouping variables
                        measure.vars = paste0("V",seq.int(1,10000))), #variable names
                      month + variable ~ rowname, #LHS groups the data along rows, RHS decides the value columns
                      value.var = "value", #Which column contain values? 
                      drop = TRUE) #should we drop unused columns? (doesnt matter here)
  #rename the columns as desired
  names(final_tests) <- c("month","variable","S","tau","var","p.value","z_stat")
  #finally add the coordinates
  final_tests <- cbind(final_form,coords) 
})

答案 1 :(得分:0)

最后,通过用mini-batch size函数替换第二个循环(受此answer启发)可以轻松解决该问题。现在,执行时间仅为几秒钟。向量化仍然是R中执行时间的最佳解决方案(请参见this postthis

我在下面共享最终代码以供参考:

25