Question

我正在尝试使用8核心的计算机上的并行程序包在 Windows 上处理R中的大量数据。我有一个大型的data.frame，我需要逐行处理。对于每一行，我都可以估计处理该行所需的时间，这可能会在每行10秒到4小时之间变化很大。

我不想在clusterApplyLB函数下一次运行整个程序（我知道这可能是最优的方法），因为如果它遇到错误，那么我的整个结果集可能会丢失。我第一次尝试运行程序包括将其分解为块，然后单独并行运行每个块，保存并行运行的输出，然后继续下一个块。

问题在于，当它遍历行时，而不是在7x“实际”时间运行（我有8个核心，但我想保留一个备用），它似乎只运行大约2倍。我猜这是因为每个核心的行分配效率低下。

例如，10行数据有2个核心，其中两行可以在4小时内运行，另外两行需要10秒。从理论上讲，这可能需要4小时10秒才能运行，但如果分配效率低下，可能需要8个小时。（显然这是夸大其词，但是当估算不正确且更多核心和更多行时，可能会发生类似的情况）

如果我估计这些时间并将其提交到clusterApplyLB，我估计这是正确的顺序（为了使估计的时间分布在核心上以最大限度地减少所花费的时间），它们可能不会被发送到我的核心希望他们成为，因为他们可能无法在我估计的时间内完成。例如，我估计有两个进程有10分钟和12分钟的时间，它们需要11.6分钟和11.4秒，然后行提交到clusterApplyLB的顺序将不是我预期的。这种错误可能看起来很小，但如果我已经优化了多个长时间行，那么这种混合顺序可能导致两个4小时的行转到同一个节点而不是不同的节点（这几乎可以使我的节点翻倍）总时间。）

TL; DR。我的问题：有没有办法告诉R并行处理函数（例如clusterApplyLB，clusterApply，parApply或任何sapply，lapply或foreach变体）应该将哪些行发送到哪个核心/节点？即使没有我发现自己的情况，我认为提供信息是非常有用和有趣的。

Answer 1

我想说有两种不同的解决方案可以解决您的问题。

第一个是根据预期的每个作业计算时间对作业到节点映射进行静态优化。在开始计算之前，您将为每个作业（即数据帧的行）分配一个节点。下面给出了可能实现的代码。

第二个解决方案是动态的，你必须根据clusterApplyLB中给出的代码制作自己的负载均衡器。您将从第一种方法开始，但是一旦完成作业，您将不得不重新计算最佳的作业到节点映射。根据您的问题，由于不断进行的重新优化，这可能会增加很大的开销。我认为，只要你没有预期的计算时间偏差，就没有必要这样做。

这是第一种解决方案的代码：

library(parallel)
#set seed for reproducible example
set.seed(1234)
#let's say you have 100 calculations (i.e., rows)
#each of them takes between 0 and 1 second computation time
expected_job_length=runif(100)
#this is your data
#real_job_length is unknown but we use it in the mock-up function below
df=data.frame(job_id=seq_along(expected_job_length),
              expected_job_length=expected_job_length,
              #real_job_length=expected_job_length + some noise
              real_job_length=expected_job_length+
                runif(length(expected_job_length),-0.05,0.05))
#we might have a negative real_job_length; fix that
df=within(df,real_job_length[real_job_length<0]<-
            real_job_length[real_job_length<0]+0.05)
#detectCores() gives in my case 4
cluster_size=4

准备作业到节点的映射优化：

#x will give the node_id (between 1 and cluster_size) for each job
total_time=function(x,expected_job_length) {
  #in the calculation below, x will be a vector of reals
  #we have to translate it into integers in order to use it as index vector
  x=as.integer(round(x))
  #return max of sum of node-binned expected job lengths
  max(sapply(split(expected_job_length,x),sum))
}

#now optimize the distribution of jobs amongst the nodes
#Genetic algorithm might be better for the optimization
#but Differential Evolution is good for now
library(DEoptim)
#pick large differential weighting factor (F) ...
#... to get out of local minimas due to rounding
res=DEoptim(fn=total_time,
            lower=rep(1,nrow(df)),
            upper=rep(cluster_size,nrow(df)),
            expected_job_length=expected_job_length,
            control=DEoptim.control(CR=0.85,F=1.5,trace=FALSE))
#wait for a minute or two ...
#inspect optimal solution
time_per_node=sapply(split(expected_job_length,
                           unname(round(res$optim$bestmem))),sum)
time_per_node
#       1        2        3        4 
#10.91765 10.94893 10.94069 10.94246
plot(time_per_node,ylim=c(0,15))
abline(h=max(time_per_node),lty=2)

#add node-mapping to df
df$node_id=unname(round(res$optim$bestmem))

现在是集群计算的时候了：

#start cluster
workers=parallel::makeCluster(cluster_size)

start_time=Sys.time()
#distribute jobs according to optimal node-mapping
clusterApply(workers,split(df,df$node_id),function(x) {
  for (i in seq_along(x$job_id)) {
    #use tryCatch to do the error handling for jobs that fail
    tryCatch({Sys.sleep(x[i,"real_job_length"])},
             error=function(err) {print("Do your error handling")})
  }
})
end_time=Sys.time()

#how long did it take
end_time-start_time
#Time difference of 11.12532 secs

#add to plot
abline(h=as.numeric(end_time-start_time),col="red",lty=2)

stopCluster(workers)

Answer 2

根据输入，您似乎已经在该任务中保存任务的输出。假设每个并行任务都将输出保存为文件，您可能需要一个初始函数来预测特定行的时间。为了做到这一点

生成具有估计时间和行号的结构
对估计的时间进行排序并重新排序行并运行并行每个重新排序的行的过程。

这会自动平衡工作量。我们遇到了类似的问题，该过程必须按列完成，每列需要 10-200秒。因此，我们生成了一个估计时间的函数，根据该函数重新排序列，并为每列运行并行处理。

R并行处理 - 节点选择

2 个答案: