foreach R包在并行运行时组合数据的问题

时间:2016-08-25 15:23:44

标签: r parallel-foreach domc

鉴于我正在使用的数据大小,我想并行处理

我已经设置了如下代码,保持一个核心空闲,因此没有使用整个机器

library(DoMC)
library(foreach)
library(itertools)

num_cores <- round(detectCores()*1-1) # num_cores is 7 in this case

registerDoMC(num_cores)

test_prediction <-
    data.frame(
    foreach(d=isplitRows(test, chunks = num_cores),
            .combine=c, 
            .packages=c("stats")) %dopar% {
                predict(cfModel, newdata=d)
            }
    )

问题是返回的test_prediction比测试的行少,我无法弄清楚为什么

几次尝试返回的行告诉我,.combine中的foreach没有从某些核心收集数据,虽然我不确定如何确认这个理论< / p>

总行数603,054

Attempt 1: rows returned > 516,903 - 6/7s of data returned
Attempt 2: rows returned > 344,602 - 4/7s of data returned
Attempt 3: rows returned > 430,753 - 5/7s of data returned

这只有在并行运行时才会发生,如果我使用%do%选项而不是返回正确的行数 - 虽然我不确定如何进一步研究这个理论?

一般情况下,如果有更好的方法并行运行,有什么帮助吗?将不胜感激

会议信息:

> sessionInfo()
R version 3.3.0 beta (2016-03-30 r70404)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] miniCRAN_0.2.7      markdown_0.7.7      slackr_1.4.2        readr_0.2.2         readxl_0.1.1        testthat_1.0.2      R2HTML_2.3.2        itertools_0.1-3     XML_3.98-1.4       
[10] rvest_0.3.2         xml2_1.0.0          devtools_1.12.0     doParallel_1.0.10   rjson_0.2.15        RCurl_1.95-4.8      bitops_1.0-6        bit64_0.9-5         bit_1.1-12         
[19] qcc_2.6             optiRum_0.37.3      scales_0.4.0        doMC_1.3.4          iterators_1.0.8     foreach_1.4.3       pryr_0.1.2          party_1.0-25        strucchange_1.5-1  
[28] sandwich_2.3-4      zoo_1.7-13          modeltools_0.2-21   mvtnorm_1.0-5       e1071_1.6-7         randomForest_4.6-12 caret_6.0-70        lattice_0.20-29     timeDate_3012.100  
[37] Kmisc_0.5.0         reshape2_1.4.1      gridExtra_2.2.1     tidyr_0.5.1         dplyr_0.5.0         plyr_1.8.4          data.table_1.9.6    sendmailR_1.2-1     RPostgreSQL_0.4-1  
[46] ggplot2_2.1.0       lubridate_1.5.6     stringr_1.0.0       sqldf_0.4-10        RSQLite_1.0.0       DBI_0.4-1           gsubfn_0.6-6        proto_0.3-10       

loaded via a namespace (and not attached):
 [1] nlme_3.1-128       pbkrtest_0.4-6     httr_1.2.1         tools_3.3.0        R6_2.1.2           lazyeval_0.2.0     mgcv_1.8-3         colorspace_1.2-6   nnet_7.3-8         withr_1.0.2       
[11] compiler_3.3.0     chron_2.3-47       quantreg_5.26      SparseM_1.7        AUC_0.3.0          digest_0.6.9       minqa_1.2.4        base64enc_0.1-3    lme4_1.1-12        jsonlite_1.0      
[21] car_2.1-2          magrittr_1.5       Matrix_1.2-6       Rcpp_0.12.6        munsell_0.4.3      stringi_1.1.1      multcomp_1.4-6     MASS_7.3-35        crayon_1.3.2       splines_3.3.0     
[31] knitr_1.13         tcltk_3.3.0        codetools_0.2-9    nloptr_1.0.4       MatrixModels_0.4-1 gtable_0.2.0       assertthat_0.1     coin_1.1-2         class_7.3-11       survival_2.39-5   
[41] tibble_1.1         memoise_1.0.0      TH.data_1.0-7   

1 个答案:

答案 0 :(得分:0)

这对我来说很好用:

library(doMC)
library(foreach)
library(itertools)

num_cores <- 2
registerDoMC(num_cores)

predict.fake <- function(object, newdata) {
  rowSums(newdata)
}

cfModel <- structure(NULL, class = "fake")
test <- matrix(rnorm(2000), ncol = 2)
pred <- predict(cfModel, test)

test_prediction <-
  foreach(d=isplitRows(test, chunks = num_cores),
          .combine=c, 
          .packages=c("stats")) %dopar% {
            predict(cfModel, newdata=d)
          }

all.equal(test_prediction, pred)