以下是我一直在开发的模拟中的R代码片段。
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE),
ind.index,
sample(i, size = 1, replace = TRUE)]
上述代码段中的 ind.index
包含对sample(...)
我在RStudio中分析了我的模拟,这条线确实是运行时和内存方面的瓶颈(~30000 ms运行时和~7000 MB)。
是否有更有效的方式表达以下代码段以便更快?
在完全进入Rcpp之前,我想完全耗尽我的基本R /包选项。
一个选项可能是plyr / dplyr包(dplyr本身依赖于Rcpp)。因为pop
是一个数组,所以为了使用dplyr,需要转换为数据帧。然后,我可以使用sample(...)
sample_n(...)
替换dplyr
。
目标是最终编写一个包,因此,CRAN提交时不允许调用.Internal(sample(...))
,但速度可能更快。
以下是完整代码:
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
pop <- array(dim = c(c(perms, num.specs), K))
## Create an ID for each haplotype ##
haps <- as.character(1:Hstar)
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
for (j in 1:perms) {
for (i in 1:K) {
pop[j, specs, i] <- sample(haps, size = num.specs, replace = TRUE, prob = probs)
}
}
## Make a matrix to hold individuals from each permutation ##
HAC.mat <- array(dim = c(c(perms, num.specs), K))
## Perform haplotype accumulation ##
for (k in specs) {
for (j in 1:perms) {
for (i in 1:K) {
ind.index <- sample(specs, size = k, replace = FALSE) # which individuals are sampled
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)] # extract those individuals from a permutation
HAC.mat[j, k, i] <- length(unique(hap.plot)) # how many haplotypes recovered a given sampling intensity (k) from each permutation (j)
}
}
}
K <- 1 # number of subpopulations
N <- 100 # number of individuals
Hstar <- 10 # number of haplotypes
probs <- rep(1/Hstar, Hstar) # haplotype frequency distribution
perms <- 10000 # number of permutations
这是一个很小的例子,非常快。但是,我的模拟功能来自调查较大的输入参数值,但这导致代码速度相当慢。
非常感谢任何帮助,并热烈欢迎。
答案 0 :(得分:1)
K <- 1 # number of subpopulations
N <- 100 # number of individuals
Hstar <- 10 # number of haplotypes
probs <- 1/Hstar # haplotype frequency distribution
perms <- 10000
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- seq_len(Hstar)
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
sim_fun <- function()
{
return(sample( x = haps,
size = num.specs,
replace = TRUE,
prob = rep(0.1, Hstar)))
}
set.seed(2L)
pop <- array(dim = c(num.specs, perms, K))
for (i in 1:K) {
pop[, , i] <- replicate(perms, sim_fun())
}
嵌套for循环减少一个级别,这将显着提高效率,因为外循环代表子群的数量,与个体数量和排列数量相比,这很可能是一个小数量。您不能在三种情况下避免采样,因为它们具有三种不同的长度。
# n_ind = number of individuals
# n_perm = number of permutations
# n_subpop = number of subpopulations
# prob = sampling probability
# FUN = summary statistics function
# summary statistics
extract_stats <- function(n_ind, n_perm, n_subpop, prob, FUN, ... )
{
ijk <- dim(pop)
sapply( seq_len(n_subpop), function( y ){
pop_dat <- pop[sample( x = seq_len(ijk[1]), size = n_ind, replace = TRUE, prob = rep( prob, ijk[1] ) ),
sample( x = seq_len(ijk[2]), size = n_perm, replace = TRUE, prob = rep( prob, ijk[2] ) ),
sample( x = seq_len(ijk[3]), size = y, replace = TRUE, prob = rep( prob, ijk[3] ) )]
ifelse( test = is.matrix(pop_dat),
yes = apply( pop_dat, MARGIN = 2, FUN = FUN ),
no = do.call(FUN, c( list(pop_dat), ...) ))
})
}
# median of haplotype id
replicate(10, extract_stats( n_ind = 100, n_perm = 2, n_subpop = 2, prob = 0.1, FUN = median))
# minimum of haplotype id
replicate(10, extract_stats( 100, 2, 2, 0.1, min))
# maximum of haplotype id
replicate(10, extract_stats( 100, 2, 2, 0.1, max))
# histogram of haplotype id distribution
replicate(1, extract_stats( n_ind = 100, n_perm = 1, n_subpop = 1, prob = 0.1, FUN = hist, xlab = "haplotype_id", main = "title"))