Question

我正在尝试使用mclapply并行交叉验证，以便为非常大的设计矩阵X（~10GB）和响应向量y建模拟合过程。我们要说X具有维度n-by-p：n=1000, p=1,000,000。由于X非常庞大，因此它作为big.matrix对象备份，存储在磁盘中，并使用R包bigmemory中的方法进行访问。

4倍交叉验证的工作流程如下。

设置长度为cv.ind的索引向量n，其存储1到4的数字序列，指示X中的哪个观察属于CV的哪个折叠。
设置4个核心。在第i个核心中，复制第i个CV CV的相应训练和测试子矩阵。
为每个核心的每个折叠拟合一个模型。
收集结果，返回。

交叉验证功能如下所示。

cv.ncvreg <- function(X, y, ncore, nfolds=5, seed, cv.dir = getwd(), 
                      cv.ind) {
  ## some more setup ...
  ## ...
  ## ...
  ## pass the descriptor info to each core ##
  xdesc <- describe(X)
  ## use mclapply instead of parLapply
  fold.results <- parallel::mclapply(X = 1:nfolds, FUN = cvf, XX=xdesc, y=y, 
                                     cv.dir = cv.dir, cv.ind=cv.ind, 
                                     cv.args=cv.args,
                                     mc.set.seed = seed, mc.silent = F, 
                                     mc.cores = ncore, mc.preschedule = F)

  ## return results
}

R函数cvf在每个核心中运行。它将第i个折叠的训练/测试矩阵复制为两个big.matrix个对象，适合模型，计算一些统计数据并返回结果。

cvf <- function(i, XX, y, cv.dir, cv.ind, cv.args) {
  ## argument 'XX' is the descriptor for big.matrix
  # reference to the big.matrix by descriptor info
  XX <- attach.big.matrix(XX)

  cat("CV fold #", i, "\t--Copy training-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy sub big.matrix for training
  idx.train <- which(cv.ind != i) ## get row idx for i-th fold training
  deepcopy(XX, rows = idx.train, type = 'double',
           backingfile = paste0('x.cv.train_', i, '.bin'),
           descriptorfile = paste0('x.cv.train_', i, '.desc'),
           backingpath = cv.dir)
  cv.args$X <- attach.big.matrix(paste0(cv.dir, 'x.cv.train_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy training-- End time: ", format(Sys.time()), "\n\n")

  cat("CV fold #", i, "\t--Copy test-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy remaining part of big.matrix for testing
  idx.test <- which(cv.ind == i) ## get row idx for i-th fold testing
  deepcopy(XX, rows = idx.test, type = 'double',
           backingfile = paste0('x.cv.test_', i, '.bin'),
           descriptorfile = paste0('x.cv.test_', i, '.desc'),
           backingpath = cv.dir)
  X2 <- attach.big.matrix(paste0(cv.dir, 'x.cv.test_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy test-- End time: ", format(Sys.time()), "\n\n")

  # cv.args$X <- XX[cv.ind!=i, , drop=FALSE]
  cv.args$y <- y[cv.ind!=i]
  cv.args$warn <- FALSE

  cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
  ## call 'ncvreg' function, fit penalized regression model
  fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                  penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)

  # fit.i <- do.call("ncvreg", cv.args)
  cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")

  y2 <- y[cv.ind==i]
  yhat <- matrix(predict(fit.i, X2, type="response"), length(y2))

  loss <- loss.ncvreg(y2, yhat, fit.i$family)
  pe <- if (fit.i$family=="binomial") {(yhat < 0.5) == y2} else NULL
  list(loss=loss, pe=pe, nl=length(fit.i$lambda), yhat=yhat)
}

到目前为止，当设计矩阵X不是太大时代码工作得很好，比如n=1000, p=100,000，大小约为1GB。 但是，如果p=1,000,000因此X的大小变为~10GB，则每个核心中的模型拟合过程就像永远一样！（以下部分）：

#...
cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
## call 'ncvreg' function, fit penalized regression model
fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)
# fit.i <- do.call("ncvreg", cv.args)
cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")
#...

备注：

如果我跑'ncvreg（）＆＃39;在原始矩阵（10GB）上，需要约2.5分钟。
如果我使用for循环但不是mclapply顺序运行交叉验证，则代码运行良好，每个折叠的模型拟合＆ncvreg（）＆＃39;也可以正常工作（约2分钟），但整个过程大约需要25分钟。
我尝试了parLapply＆＃39;起初有同样的问题。我切换到了麦克莱普＆＃39;由于原因here。
每个核心中的数据复制步骤（即deepcopy部分）运行良好，需要约2分钟才能将培训和测试数据集复制并在磁盘上进行文件备份
我试图监控CPU使用情况，下面是一个截图。我们可以看到，在左图中，4个rsession中的每个占用了大约25％的CPU，而有一个进程， kernel_task 占用了大约100％的CPU。随着时间的推移， kernel_task 甚至可以占用150％的CPU。此外，CPU历史记录面板（右下角）显示大部分CPU使用率来自系统，而非用户，因为红色区域占据绿色区域。

我的问题：

为什么模型拟合过程在并行时会永远存在？可能的原因是什么？
我是否正确地将big.matrix的CV程序并行化？还有其他方法吗？

我感谢任何有助于解决我的问题的见解。在此先感谢!!

用于在R

0 个答案: