Question

我正在运行一个类似于查找标准偏差的函数......但运行时间要长得多。

我打算使用该函数来计算标准差的累积值，即标准偏差类型函数的第1天到第n天。

然而，由于计算需要很长时间，我想在群集上运行它。

所以我想分割数据，以便群集的每个节点大致在同一时间完成。例如如果我的功能如下，单机方法将按以下方式工作：

vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
lapply(1:length(vec), function(x){
    Sys.sleep(30)
    sd(as.numeric(vec[1:x]))
}

（N.B在那里添加了sys.sleep来表示处理我的自定义函数所需的额外时间）

然而，假设我想将其拆分为两台机器，而不是1，我将如何拆分向量1:length(vec)，以便我可以为每台机器提供c(1:y)到机器1和{的列表机器2 {1}}，以便两台机器按时完成。也就是y的价值是什么，这两个过程几乎同时完成......如果我们在10台机器上完成它会怎么样......如何在原始载体中找到断点{{ 1}}为了工作...

即。我会

c((y+1):length(vec))

Answer 1

parallel package现在是基础R的一部分，可以帮助在适度大小的群集上运行R，包括在Amazon EC2上。函数parLapplyLB将通过集群的工作节点从输入向量分配工作。

要知道的是makePSOCKcluster（目前截至R 2.15.2）NCONNECTIONS constant in connections.c仅限于128名工人。

以下是使用您可以在自己的计算机上尝试的并行程序包的会话的快速示例：

library(parallel)
help(package=parallel)

## create the cluster passing an IP address for
## the head node
## hostname -i works on Linux, but not on BSD
## descendants (like OS X)
# cl <- makePSOCKcluster(hosts, master=system("hostname -i", intern=TRUE))

## for testing, start a cluster on your local machine
cl <- makePSOCKcluster(rep("localhost", 3))

## do something once on each worker
ans <- clusterEvalQ(cl, { mean(rnorm(1000)) })

## push data to the workers
myBigData <- rnorm(10000)
moreData <- c("foo", "bar", "blabber")
clusterExport(cl, c('myBigData', 'moreData'))

## test a time consuming job
## (~30 seconds on a 4 core machine)
system.time(ans <- parLapplyLB(cl, 1:100, function(i) {
  ## summarize a bunch of random sample means
  summary(
    sapply(1:runif(1, 100, 2000),
           function(j) { mean(rnorm(10000)) }))
}))

## shut down worker processes
stopCluster(cl)

Bioconductor小组已经建立了一种非常简单的入门方式：Using a parallel cluster in the cloud

有关在EC2上使用并行程序包的详细信息，请参阅：R in the Cloud，对于群集上的R，请参阅：CRAN Task View: High-Performance and Parallel Computing with R。

最后，R外部另一个成熟的选项是Starcluster。

Answer 2

查看snow包 - 特别是clusterApplyLB函数来处理负载均衡的应用函数。

实际上，这将比仅使用偶数分区更智能地处理节点/核心的工作分配。

Answer 3

考虑通过RHIPE使用Hadoop（又名MapReduce）。

用于并行处理的负载平衡

3 个答案: