我希望有人可以快速查看这个示例,并帮助我找到一种更好,更有效的方法来解决这个问题。我想运行模拟来检查动物如何在一系列特定条件下在站点之间移动。我有5个站点和一些初始概率,
N<-5 # number of sites sites<-LETTERS[seq(from=1,to=N)] to.r<-rbind(sites) p.move.r<-seq.int(0.05,0.95,by=0.1) # prob of moving to a new site p.leave<-0.01*p.move.r # prob of leaving the system w/out returning p.move.out<-0.01*p.move.r # prob of moving in/out p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site
对于这个例子,我只包括50个模拟,但实际上我想至少有1000个模拟,
set.seed(13973) reps<-50 # number of replicates/simulations steps<-100 # number of time steps (hours, days, weeks, etc) random<-runif(10000,0,1) # generating numbers from a random distribution # Construct empty df to fill with data rep.movements<-matrix(NA,nrow=reps,ncol=steps) colnames(rep.movements)<-c(1:steps);rownames(rep.movements)<-c(1:reps) rep.use<-matrix(NA,nrow=reps,ncol=N) colnames(rep.use)<-c(reefs);rownames(rep.use)<-c(1:reps) # Outer loop to run each of the initial parameters for(w in 1:length(p.stay)){ p.move<-matrix((p.move.r[w]/(N-1)),N,N) diag(p.move)<-0 # Construction of distance matrix move<-matrix(c(0),nrow=(N+2),ncol=(N+2),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left"))) from<-array(0,c((N+2),(N+2)),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left"))) to<-array(0,c((N+2),(N+2)),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left"))) # Filling movement-Matrix construction for(from in 1:N){ for(to in 1:N){ if(from==to){move[from,to]<-p.stay[w]} else {move[from,to]<-p.move[from,to]} move[,(N+1)]<-(1-(p.leave[w]+p.move.out[w]))/N move[,(N+2)]<-(1-(p.leave[w]+p.move.out[w]))/N move[(N+1),]<-p.move.out[w] move[(N+2),]<-p.leave[w] }
}
这个想法是使用这个累积概率矩阵来确定基于随机数的动物的命运,
cumsum.move<-cumsum(data.frame(move)) # Cumulative sum of probabilities
在此累积矩阵中,字母“A”,“B”,“C”,“D”和“E”代表不同的网站,“NA”代表离开和返回未来时间步的概率, “left”表示离开系统而不回来的可能性。然后我使用随机数列表与累积概率矩阵进行比较,并确定该特定动物的“命运”。
for(o in 1:reps){
result<-matrix(as.character(""),steps) # Vector for storing sites x<-sample(random,steps,replace=TRUE) # sample array of random number time.step<-data.frame(x) # time steps used in the simulation (i) colnames(time.step)<-c("time.step") time.step$event<-"" j<-sample(1:N,1,replace=T) # first column to be selected k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out for(i in 1:steps){ for (t in 1:(N+1)){ if(time.step$time.step[i]<cumsum.move[t,j]){ time.step$event[i]<-to.r[t] break } } ifelse(time.step$event[i]=="",break,NA) result[i]<-time.step$event[i] j<-which(to.r==result[i]) if(length(j)==0){j<-k} } result<-time.step$event # calculate frequency/use for each replicate use<-table(result) use.tab<-data.frame(use) use.tab1<-use.tab[-which(use.tab==""),] mergeuse<-merge(use.tab2,use.tab,all.x=TRUE) mergeuse[is.na(mergeuse)]<-0 # insert data into empty matrix rep.movements[o,]<-result rep.use[o,]<-mergeuse$Freq } }
# for the outer loop I have some matrices to store the results for each parameter,
# but for this example this is not important
rep.movements rep.use
现在,主要问题是需要很长时间才能为每个初始参数运行所有模拟(本例中为10个值)。我需要找到一种更好/更有效的方法来在所有初始参数中运行1000个模拟/ 20个站点。我不太熟悉功能或其他方法来加快这项任务。任何想法或建议将不胜感激。
提前多多感谢,
答案 0 :(得分:1)
让我们先将代码包装在一个函数中。我还添加了set.seed
命令以使结果可重现。您需要在运行模拟之前删除它们。
sim1 <- function(reps=50, steps=100 ) {
N<-5 # number of sites
sites<-LETTERS[seq(from=1,to=N)]
to.r<-rbind(sites)
p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site
p.leave<-0.01*p.move.r # prob of leaving the system w/out returning
p.move.out<-0.01*p.move.r # prob of moving in/out
p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site
set.seed(42)
random<-runif(10000,0,1) # generating numbers from a random distribution
cumsum.move <- read.table(text="A B C D E NA. left
A 0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964
B 0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928
C 0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892
D 0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856
E 0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820
NA. 0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910
left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE)
cumsum.move <- as.matrix(cumsum.move)
for(o in 1:reps){
result<-matrix(as.character(""),steps) # Vector for storing sites
set.seed(42)
x<-sample(random,steps,replace=TRUE) # sample array of random number
time.step<-data.frame(x) # time steps used in the simulation (i)
colnames(time.step)<-c("time.step")
time.step$event<-""
set.seed(41)
j<-sample(1:N,1,replace=T) # first column to be selected
set.seed(40)
k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out
for(i in 1:steps){
for (t in 1:(N+1)){
if(time.step$time.step[i]<cumsum.move[t,j]){
time.step$event[i]<-to.r[t]
break
}
}
ifelse(time.step$event[i]=="",break,NA)
result[i]<-time.step$event[i]
j<-which(to.r==result[i])
if(length(j)==0){j<-k}
}
result<-time.step$event
}
result
}
请注意,在o的每次迭代期间都会覆盖result
。我不认为你想要那个,所以我解决了这个问题。此外,您在循环内使用data.frame
。作为一般规则,你应该避免像瘟疫那样的data.frames
内循环。虽然它们非常方便,但在效率方面却很糟糕。
sim2 <- function(reps=50, steps=100) {
N<-5 # number of sites
sites<-LETTERS[seq(from=1,to=N)]
to.r<-rbind(sites)
p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site
p.leave<-0.01*p.move.r # prob of leaving the system w/out returning
p.move.out<-0.01*p.move.r # prob of moving in/out
p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site
set.seed(42)
random<-runif(10000,0,1) # generating numbers from a random distribution
cumsum.move <- read.table(text="A B C D E NA. left
A 0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964
B 0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928
C 0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892
D 0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856
E 0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820
NA. 0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910
left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE)
cumsum.move <- as.matrix(cumsum.move)
res <- list()
for(o in 1:reps){
result<-character(steps) # Vector for storing sites
set.seed(42)
time.step<-sample(random,steps,replace=TRUE) # sample array of random number
#time.step<-data.frame(x) # time steps used in the simulation (i)
#colnames(time.step)<-c("time.step")
#time.step$event<-""
event <- character(steps)
set.seed(41)
j<-sample(1:N,1,replace=T) # first column to be selected
set.seed(40)
k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out
for(i in 1:steps){
for (t in 1:(N+1)){
if(time.step[i]<cumsum.move[t,j]){
event[i]<-to.r[t]
break
}
}
ifelse(event[i]=="",break,NA)
result[i]<-event[i]
j<-which(to.r==result[i])
if(length(j)==0){j<-k}
}
res[[o]]<-event
}
do.call("rbind",res)
}
两个函数都能给出相同的结果吗?
res1 <- sim1()
res2 <- sim2()
all.equal(res1,res2[1,])
[1] TRUE
新版本更快吗?
library(microbenchmark)
microbenchmark(sim1(),sim2())
Unit: milliseconds
expr min lq median uq max
1 sim1() 204.46339 206.58508 208.38035 212.93363 269.41693
2 sim2() 77.55247 78.39698 79.30539 81.73413 86.84398
嗯,三分之一已经相当不错了。由于那些break
,我没有看到进一步改进循环的可能性。这只留下并行化作为一种选择。
sim3 <- function(ncore=1,reps=50, steps=100) {
require(foreach)
require(doParallel)
N<-5 # number of sites
sites<-LETTERS[seq(from=1,to=N)]
to.r<-rbind(sites)
p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site
p.leave<-0.01*p.move.r # prob of leaving the system w/out returning
p.move.out<-0.01*p.move.r # prob of moving in/out
p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site
set.seed(42)
random<-runif(10000,0,1) # generating numbers from a random distribution
cumsum.move <- read.table(text="A B C D E NA. left
A 0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964
B 0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928
C 0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892
D 0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856
E 0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820
NA. 0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910
left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE)
cumsum.move <- as.matrix(cumsum.move)
#res <- list()
#for(o in 1:reps){
cl <- makeCluster(ncore)
registerDoParallel(cl)
res <- foreach(1:reps) %dopar% {
result<-character(steps) # Vector for storing sites
set.seed(42)
time.step<-sample(random,steps,replace=TRUE) # sample array of random number
#time.step<-data.frame(x) # time steps used in the simulation (i)
#colnames(time.step)<-c("time.step")
#time.step$event<-""
event <- character(steps)
set.seed(41)
j<-sample(1:N,1,replace=T) # first column to be selected
set.seed(40)
k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out
for(i in 1:steps){
for (t in 1:(N+1)){
if(time.step[i]<cumsum.move[t,j]){
event[i]<-to.r[t]
break
}
}
ifelse(event[i]=="",break,NA)
result[i]<-event[i]
j<-which(to.r==result[i])
if(length(j)==0){j<-k}
}
#res[[o]]<-event
event
}
stopCluster(cl)
do.call("rbind",res)
}
同样的结果?
res3 <- sim3()
all.equal(res1,c(res3[1,]))
[1] TRUE
更快? (我们在Mac上使用4个核心。您可能会尝试访问具有更多核心的服务器。)
microbenchmark(sim1(),sim2(),sim3(4))
Unit: milliseconds
expr min lq median uq max
1 sim1() 202.28200 207.64932 210.32582 212.69869 255.2732
2 sim2() 75.39295 78.95882 80.01607 81.49027 125.0866
3 sim3(4) 1031.02755 1046.41610 1052.72710 1061.74057 1091.2175
看起来很可怕。但是,该测试对并行功能不公平。该函数被调用100次,只有50次重复。这意味着我们可以获得并行化的所有开销,但几乎没有从中受益。让我们更公平:
microbenchmark(sim1(rep=10000),sim2(rep=10000),sim3(ncore=4,rep=10000),times=1)
Unit: seconds
expr min lq median uq max
1 sim1(rep = 10000) 42.16821 42.16821 42.16821 42.16821 42.16821
2 sim2(rep = 10000) 16.13822 16.13822 16.13822 16.13822 16.13822
3 sim3(ncore = 4, rep = 10000) 38.18873 38.18873 38.18873 38.18873 38.18873
更好,但仍然不令人印象深刻。如果重复次数和步骤数进一步增加,并行函数看起来会很好,但我不知道你是否需要它。