Question

并行R相当新。快速问题。我有一个计算密集的算法。幸运的是，它可以很容易地分解成片段以使用multicore或snow。我想知道的是，在实践中将multicore与snow结合使用是否合适？

我想要做的是将我的负载拆分为在群集中的多台计算机上以及每台计算机上运行。我想利用机器上的所有核心。对于这种类型的处理，将雪与multicore混合是否合理？

Answer 1

我使用了lockoff上面提出的方法，即使用并行程序包在多个具有多个核心的计算机上分配一个令人尴尬的并行工作负载。首先，工作负载分布在所有计算机上，然后每台计算机的工作负载分布在其所有核心上。这种方法的缺点是机器之间没有负载平衡（至少我不知道如何）。

所有加载的r代码应该是相同的并且在所有机器（svn）上的相同位置。因为初始化集群需要相当长的时间，所以可以通过重用创建的集群来改进下面的代码。

foo <- function(workload, otherArgumentsForFoo) {
    source("/home/user/workspace/mycode.R")
    ...
}

distributedFooOnCores <- function(workload) {
    # Somehow assign a batch number to every record
    workload$ParBatchNumber = NA
    # Split the assigned workload into batches according to DistrParNumber
    batches = by(workload, workload$ParBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$ParBatchNumber = NULL
    return(invisible(results))
}

distributedFooOnMachines <- function(workload) {
    # Somehow assign a batch number to every record
    workload$DistrBatchNumber = NA
    # Split the assigned activity into batches according to DistrBatchNumber
    batches = by(workload, workload$DistrBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    # If makeCluster hangs, please make sure passwordless ssh is configured on all machines
    cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$DistrBatchNumber = NULL
    return(invisible(results))
}

我很感兴趣如何改进上述方法。

将多核与Snow Cluster相结合

1 个答案: