多巴glmnet默默地失败了

时间:2018-03-01 09:38:38

标签: r parallel-processing glmnet domc

我正在使用glmnet来适应某些模型,并且正在为lambda进行交叉验证。我默认使用cv.glmnet(因为它确实完成了交叉验证  在内部lambda,但在下面我关注的是该函数的第一步,即导致问题的那一步。

首次数据设置。我没有做出可重现的示例,也无法共享原始数据,但是dim(smat)大约是4.7M行到50列,其中大约一半是密集的。我尝试了一种简单的方法来用完全随机的列重现问题,但无济于事。

# data setup (censored)
library(data.table)
DT = fread(...)
n_cv = 10L

# assign cross-validation group to an ID (instead of to a row)
IDs = DT[ , .(rand_id = runif(1L)), keyby = ID]
IDs[order(rand_id), cv_grp := .I %% n_cv + 1L]
DT[IDs, cv_grp := i.cv_grp, on = 'ID']

# key by cv_grp to facilitate subsetting different training sets
setkey(DT, cv_grp)
# assign row number as column to facilitate subsetting model matrix
DT[ , rowN := .I]

library(glmnet)
library(Matrix)

# y is 0/1 (actually TRUE/FALSE)
model = y ~ ...
smat = sparse.model.matrix(model, data = DT)
# this is what's done internally to 0-1 data to create
#   an n x 2 matrix with FALSE in the 1st and TRUE in the 2nd column
ymat = diag(2L)[factor(DT$y), ]

以下是cv.glmnet在传递给cv.lognet之前所做的定制版本:

train_models = lapply(seq_len(n_cv), function(i) {
  train_idx = DT[!.(i), rowN]
  glmnet(smat[train_idx, , drop = FALSE], ymat[train_idx, ],
         alpha = 1, family = 'binomial')
})

这似乎工作正常,但速度很慢。如果我们用parallel = TRUE的等效版本替换它:

library(doMC)
registerDoMC(detectCores())
train_models_par = foreach(i = seq_len(n_cv), .packages = c("glmnet", "data.table")) %dopar% {
  train_idx = DT[!.(i), rowN]
  glmnet(smat[train_idx, , drop = FALSE], ymat[train_idx, ],
         alpha = 1, family = 'binomial')
}

glmnet调用在某些节点上静默失败(与any(sapply(train_models, is.null)) FALSE相比):

sapply(train_models_par, is.null)
# [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

哪个任务失败是不一致的(因此它不是问题,例如,cv_grp = 2 本身)。我已经尝试捕获glmnet的输出并检查is.null无济于事。我还将.verbose = TRUE标记添加到foreach,并且没有任何可疑的迹象。请注意,data.table语法是辅助语法,因为cv.glmnet的默认行为(也会导致类似的失败)依赖于使用which = foldid == i来分割训练和测试集。

如何调试此问题?为什么在并行化时,任务可能会失败,但不是串行的,如何在任务失败时捕获(例如,我可以尝试重试)?

有关环境的最新信息:

sessionInfo()
# R version 3.4.3 (2017-11-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 16.04.3 LTS
# 
# Matrix products: default
# BLAS: /usr/lib/libblas/libblas.so.3.6.0
# LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8      
#  [2] LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8       
#  [4] LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8   
#  [6] LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8      
#  [8] LC_NAME=C                 
#  [9] LC_ADDRESS=C              
# [10] LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8
# [12] LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] parallel  stats     graphics  grDevices utils    
# [6] datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_2.2.1     doMC_1.3.5       
# [3] iterators_1.0.8   glmnet_2.0-13    
# [5] foreach_1.4.3     Matrix_1.2-12    
# [7] data.table_1.10.5
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_0.12.14     lattice_0.20-35 
#  [3] codetools_0.2-15 plyr_1.8.3      
#  [5] grid_3.4.3       gtable_0.1.2    
#  [7] scales_0.5.0     rlang_0.1.4     
#  [9] lazyeval_0.2.1   tools_3.4.3     
# [11] munsell_0.4.2    yaml_2.1.13     
# [13] compiler_3.4.3   colorspace_1.2-4
# [15] tibble_1.3.4   

system('free -m')
# total        used        free      shared  buff/cache   available
# Mem:          30147        1786       25087           1        3273       28059
# Swap:             0           0           0

detectCores()
# [1] 16

system('lscpu | grep "Model name"')
# Model name:            Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz

0 个答案:

没有答案