我正在运行嵌套的3层foreach循环,但无法阻止100%占用远程服务器的代码(Linux,centOS,物理核心= 14,逻辑核心= 56)。我使用的框架是:
Library(doParallel)
doParallel::registerDoParallel(20)
outRes <- foreach1(I = seq1, …) %:%
foreach2(j = seq2, …) %dopar% {
innerRes <- foreach3(k = seq3, …)
}
我有三个问题。
PS:附加了可重现的代码示例。
library(mlbench)
data("Sonar")
str(Sonar)
table(Sonar$Class)
seed <- 1234
# for cross validation
number_outCV <- 10
repeats_outCV <- 10
number_innerCV <- 10
repeats_innerCV <- 10
# list of numbers of features to model
featureSeq <- c(10, 30, 50)
# for LASSO training
lambda <- exp(seq(-7, 0, 1))
alpha <- 1
dataList <- list(data1 = Sonar, data2 = Sonar, data3 = Sonar, data4 = Sonar, data5 = Sonar, data6 = Sonar)
# library(doMC)
# doMC::registerDoMC(cores = 20)
library(doParallel)
doParallel::registerDoParallel(20)
nestedCV <- foreach::foreach(clust = 1:length(dataList), .combine = "c", .verbose = TRUE) %:%
foreach::foreach(outCV = 1:(number_outCV*repeats_outCV), .combine = "c", .verbose = TRUE) %dopar% {
# prepare data
dataset <- dataList[[clust]]
table(dataset$Class)
# split data into model developing and testing data in the outCV: repeated 10-fold CV
set.seed(seed)
ResampIndex <- caret::createMultiFolds(y = dataset$Class, k = number_outCV, times = repeats_outCV)
developIndex <- ResampIndex[[outCV]]
developX <- dataset[developIndex, !colnames(dataset) %in% c("Class")]
developY <- dataset$Class[developIndex]
testX <- dataset[-developIndex, !colnames(dataset) %in% c("Class")]
testY <- dataset$Class[-developIndex]
# get a pool of all the features
features_all <- colnames(developX)
# training model with inner repeated 10-fold CV
# foreach for nfeature search
nfeatureRes <- foreach::foreach(featNumIndex = seq(along = featureSeq), .combine = "c", .verbose = TRUE) %dopar% {
nfeature <- featureSeq[featNumIndex]
selectedFeatures <- features_all[1:nfeature]
# train LASSO
lassoCtrl <- trainControl(method = "repeatedCV",
number = number_innerCV,
repeats = repeats_innerCV,
verboseIter = TRUE, returnResamp = "all", savePredictions = "all",
classProbs = TRUE, summaryFunction = twoClassSummary)
lassofit.cv <- train(x = developX[, selectedFeatures],
y = developY,
method = "glmnet",
metric = "ROC",
trControl = lassoCtrl,
tuneGrid = expand.grid(lambda = lambda, alpha = alpha),
preProcess = c("center", "scale"))
AUC.test <- pROC::auc(response = testY, predictor = predict(lassofit.cv, newdata = testX[, selectedFeatures], type = "prob")[[2]])
performance <- data.frame(Class = clust, outCV = outCV, nfeature = nfeature, AUC.cv = max(lassofit.cv$results$ROC), AUC.test = as.numeric(AUC.test))
}
# end of nfeature search foreach loop
nfeatureRes
}
# end of outCV foreach loop as well as the dataList foreach loop
foreach::registerDoSEQ()
答案 0 :(得分:1)
如果要确保代码仅使用一定数量的内核,则可以将流程固定到特定内核。这称为"CPU affinity",在R中您可以使用parallel::mcaffinity
进行设置,例如:
parallel::mcaffinity(1:20)
允许您的R进程仅使用前20个核心。无论此过程中使用的其他库如何,这都有效,因为它调用操作系统级别的资源控制(一些罕见的库生成或与其他进程通信,但您的代码似乎没有使用类似的东西)。
%:%
是嵌套foreach
循环的正确方法 - foreach
包将在其调度中考虑内部和外部循环,并且仅执行registerDoParallel
内部主体时间 - 它们是否来自相同的外循环迭代。错误的方式是例如foreach(…) %dopar% { foreach(…) %dopar% { … } }
- 这会产生registerDoParallel
- 一次计算的平方数(因此,在你的情况下是400)。 foreach(…) %do% { foreach(…) %dopar% { … } }
(或其他方式)会更好,但不是最理想的。有关详细信息,请参阅foreach
&#39; nesting
vignette。
在您的情况下,最好将两个外部循环保持为现在(%:%
和%doPar%
),并将内部循环更改为%do%
。在两个外部循环中总共仍然有很多迭代来填充20个核心,并且通常的规则是,如果可能的话,并行化外部循环比内部循环更好。
答案 1 :(得分:0)
我不知道这是否可行,但也许您可以通过运行&#34; nice&#34;来尝试降低服务器优先级。命令(这样,即使它使用100%CPU,也只能在空闲时间使用)?
答案 2 :(得分:0)
通过许多实验,我猜这是foreach()
分叉工人的方式:
如果使用嵌套的foreach(例如foreach() %:% foreach() %dopar% {}
):分叉的工作者(共享存储的逻辑CPU核心)将是在foreach()
乘以foreach()
倍之前注册的核心。 E.g:
registerDoMc(cores = 10)
foreach() %:% foreach() %:% foreach() %dopar% {} # 10x3 = 30 workers will be finally forked in the following example.
如果foreach()
嵌套在另一个foreach()
而不使用%:%
,则分叉的工作人员(逻辑CUP核心)将是从%:%
部分注册的核心乘以独立的嵌套部分。 E.g:
registerDoMc(cores = 10)
foreach() %:% foreach() %dopar% { foreach()} # (10+10)x10 = 200 workers will the finally forked.
如果错误,欢迎任何更正。