并行化不可行时加速R代码的选项

时间:2017-12-12 19:20:10

标签: r performance simulation

我开发了一个模拟,根据输入参数的值,它可能非常慢(最多6个小时)。

分析我的代码会发现sample()是瓶颈,但据我所知,没有更好的函数(sample()已经针对最大速度进行了优化,因为它是用C语言编写的。)

由于算法迭代彼此依赖,因此并行化也不起作用。

有关改善运行时间的替代方案的任何建议吗?

热烈欢迎任何建议。

以下是我的模拟代码:

单倍型累积曲线模拟器

HAC.sim <- function(K = 1, N, Hstar, probs, m = 0, perms = 10000, p = 0.95){

pop <- array(dim = c(c(perms, ceiling((1 - m) * N / K)), K))
haps <- as.character(1:Hstar)
specs <- 1:ceiling((1 - m) * N / K)

for (j in 1:perms){
    for (i in 1:K){
        pop[j, specs, i] <- sample(haps, size = length(specs), replace = TRUE, prob = probs)
    }
}


HAC.mat <- array(dim = c(c(perms, length(specs), K)))

for (k in specs){
    for (j in 1:perms){
        for (i in 1:K){
            ind.index <- sample(specs, size = k, replace = FALSE) 
            hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)] 
            HAC.mat[j, k, i] <- length(unique(hap.plot))
        }
    }
}

means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))

d <- assign("d", data.frame(specs, means), envir = .GlobalEnv)

P <- max(means)
Q <- Hstar - max(means)
R <- assign("R", max(means) / Hstar, envir = .GlobalEnv)
S <- (Hstar - max(means)) / Hstar
Nstar <- assign("Nstar", (N * Hstar) / max(means), envir = .GlobalEnv)
X <- ((N * Hstar) / max(means)) - N

cat("\n Measures of Sampling Closeness \n \n Mean number of haplotypes sampled: " , P, "\n Mean number of haplotypes not sampled: " , Q, "\n Proportion of haplotypes sampled: " , R, "\n Proportion of haplotypes not sampled:  " , S, "\n \n Calculated mean value of N*: ", Nstar, "\n Mean number of individuals not sampled: ", X, "\n \n")

if (R < p){
    cat("Desired level of H* has not yet been reached \n")
    } else{
        cat("Desired level of H* has been reached")
}

par(mfrow = c(1, 2))

plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes",  ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(length(specs) * probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)

}

快速重现的例子

N <- 50
Hstar <- 5
probs <- rep(1/Hstar, Hstar)

HAC.sim(N = N, Hstar = Hstar, probs = probs)

1 个答案:

答案 0 :(得分:0)

当然,Rcpp是一个选项,但需要您用C ++重写部分代码。一个非侵入性选项是使用compiler包,可以轻松地提高速度:

#old R version of lapply
slow_func <- function(X, FUN, ...) {
   FUN <- match.fun(FUN)
   if (!is.list(X))
    X <- as.list(X)
   rval <- vector("list", length(X))
   for(i in seq(along = X))
    rval[i] <- list(FUN(X[[i]], ...))
   names(rval) <- names(X)          # keep `names' !
   return(rval)
}

# Compiled versions
require(compiler)
slow_func_compiled <- cmpfun(slow_func)

您可以阅读更多相关信息here。但是,如果唯一的问题是sample,则Rcpp似乎有alternative implemention。看起来它在所有情况下都不会更快。