Question

我编写了一个程序，使用mvtnorm生成大量随机多变量分布式数据（25 x 30 x 10 000 000），然后进行一些简单的计算和操作矩阵。

我使用foreach和doParallel包来并行运行操作以减少时间。一个完全随意的例子，只是为了演示包：

foreach (x = matr) %dopar% {
    x[time_horizon + 1] <- x[time_horizon]
    x <- cbind(100,x)
    for (m in 2:(time_horizon + 1)) {
      # loop through each row of matrix to apply function
      x[,m] <- x[,m-1] + x[,m]
    }
    return(x)
  }

我创建了一个隐式内核集群来运行这些foreach函数：

registerDoParallel(4)

问题

当我运行多个内核时，它似乎会增加或复制我在任务管理器上监控性能时使用的内存（即2个内核使用的内存超过1个内核，4个内核使用的内存超过2个内存）。

当我运行（25 x 30 x 1 000 000）的程序时，并行运行有助于执行速度（即4个核心比1个核心快）。但是，当我运行（25 x 30 x 2 500 000）以上的程序时，会使用太多内存，这似乎会减慢它的速度。

一位朋友说它可能是页面错误，当我用完RAM时必须访问硬盘。

为什么核心内存重复发生？这应该发生吗？我可以阻止它吗？还有其他解决方案吗？

修改（完整代码）：

library(mvtnorm)
library(foreach)
library(doParallel)
library(ggplot2)
library(reshape2)
library(plyr)

# Calculate the number of cores
no_cores <- detectCores()

# Create an implicit cluster and regular cluster
registerDoParallel(no_cores)

daily_pnl <- function() {
  time_horizon <- 30
  paths <- 2500000
  asset <- 25
  path_split <- 100

  corr_mat <- diag(asset)
  expected_returns <- runif(asset,0.0, 0.25)

  # Create a list of vectors to store pnl information for each asset

  foreach(icount(time_horizon), .packages = "mvtnorm") %dopar% {
    average_matrix <- matrix(, (paths/path_split), asset)
    split_start <- 1
    my_day <- rmvnorm(paths, expected_returns, corr_mat, method="chol")
    for (n in 1:(paths/path_split)) {
      average_matrix[n,] <- colMeans(my_day[split_start:(split_start + path_split - 1),])
      split_start <- split_start + path_split
    }
    return(average_matrix)
  }
}

matrix_splitter <- function(matr) {
  time_horizon <- 30
  paths <- 2500000
  path_split <- 100
  asset <- 25

  alply(array(unlist(daily), c(paths/path_split,time_horizon,asset)),3)
}

cum_returns <- function(matr) {
  time_horizon <- 30
  paths <- 2500000
  asset <- 25

  foreach (x = matr) %dopar% {
    x[time_horizon + 1] <- x[time_horizon]
    x <- cbind(100,x)
    for (m in 2:(time_horizon + 1)) {
      # loop through each row of matrix to apply function
      x[,m] <- x[,m-1] + x[,m]
    }
    return(x)
  }
}

plotting <- function(path_matr) {
  security_paths <- as.data.frame(t(path_matr))
  security_paths$id <- 1:nrow(security_paths)
  plot_paths <- melt(security_paths, id.var="id")

  ggplot(plot_paths, aes(x=id, y=value,group=variable,colour=variable)) +
    geom_line(aes(lty=variable))

}

system.time(daily <- daily_pnl())
system.time(daily_by_security <- matrix_splitter(daily))
rm(daily)
gc()
system.time(security_paths <- cum_returns(daily_by_security))
rm(daily_by_security)
gc()

plot_list <- foreach(x = security_paths, .packages = c("reshape2", "ggplot2")) %dopar% {
  if (nrow(x) > 100) {
    plotting(head(x,100))
  } else {
    plotting(x)
  }
}

#Stop implicit cluster and regular cluster
stopImplicitCluster()

gc()

Answer 1

这似乎是一个非常老的问题。我有类似的问题。我不需要计算并行化，我实际上需要内存并行化。（如果可以存在这样的东西）

对我有用的是天蓝色的并行处理。而不是注册系统核心，而是使用registerDoAzureParallel(cluster)

从云中注册核心

您的json将定义您为此工作雇用的机器（内存）的大小。确保每个工作人员都有足够的内存来获取您的r环境的副本。这可能会杀死您的网络。您将要从您的计算机向30 -40（取决于您要的数量）工作人员发送数据。

更多文档在这里。 https://github.com/Azure/doAzureParallel

我们可以用Sparklyr来解决此类问题吗？

R中的并行内存复制/使用？

1 个答案: