我的任务是评估各种环境变量如何影响年度人口波动。为此,我需要拟合时间序列计数的泊松自回归模型:
其中N i,j 是i
年中j
站点x_{i,j}
的观察个体数,i
是站点j
的环境变量}年\mu_{i,j}
- 这些是输入数据,其余是参数:i
是j
年中\gamma_{j}
网站的预期个人数,{{1}} 1}}是每年的随机效应。
是否可以在R中安装这样的模型?我想避免在贝叶斯框架中拟合它,因为计算需要很长时间(我必须处理5000个这样的模型)我试图将模型转换为GLM,但是一旦我不得不添加随机效果(gamma)它就没有了更长的时间。
答案 0 :(得分:10)
首先,让我们创建一些模拟数据(答案结尾处的所有临时函数):
set.seed(12345) # updated to T=20 and L=40 for comparative purposes.
T = 20 # number of years
L = 40 # number of sites
N0 = 100 # average initial pop (to simulate data)
sd_env = 0.8 # to simulate the env (assumed mean 0)
env = matrix(rnorm(T*L, mean=0, sd=sd_env), nrow=T, ncol=L)
# 'real' parameters
alpha = 0.1
beta = 0.05
sd = 0.4
gamma = rnorm(T-1, mean=0, sd=sd)
mu_ini = log(rpois(n=L, lambda=N0)) #initial means
par_real = list(alpha=alpha, beta=beta, gamma=gamma,
sd=sd, mu_ini=mu_ini)
mu = dynamics(par=par_real, x=env, T=T, L=L)
# observed abundances
n = matrix(rpois(length(mu), lambda=mu), nrow=T, ncol=L)
现在,对于一组给定的参数,我们可以模拟每个站点和年份的预期个人数量。由于我们有观察到的个体数量,我们可以编写观察的似然函数(Poisson分布)并为增长率的年度偏差添加惩罚(使其正态分布)。为此,函数dynamics
将模拟模型,函数.getLogLike
将计算目标函数。现在我们需要优化目标函数。要估算的参数为alpha
,beta
,年度偏差(gamma
)和初始预期的个人数量(mu_ini
),可能还有sigma
。
首次尝试时,我们可以为所有参数提供0作为初始猜测,但是对于初始预期数字,我们可以使用初始观察到的丰度(无论如何都是MLE)。
fit0 = fitModel0(obs=n, env=env, T=T, L=L)
Optimal parameters:
alpha beta gamma1 gamma2 gamma3
0.28018842 0.05464360 -0.12904373 -0.15795001 -0.04502903
gamma4 gamma5 gamma6 gamma7 gamma8
0.05045117 0.08435066 0.28864816 0.24111786 -0.80569709
gamma9 gamma10 gamma11 gamma12 gamma13
0.22786951 0.10326086 -0.50096088 -0.08880594 -0.33392310
gamma14 gamma15 gamma16 gamma17 gamma18
0.22664634 -0.47028311 0.11782381 -0.16328820 0.04208037
gamma19 mu_ini1 mu_ini2 mu_ini3 mu_ini4
0.17648808 4.14267523 4.19187205 4.05573114 3.90406443
mu_ini5 mu_ini6 mu_ini7 mu_ini8 mu_ini9
4.08975038 4.17185883 4.03679049 4.23091760 4.04940333
mu_ini10 mu_ini11 mu_ini12 mu_ini13 mu_ini14
4.19355333 4.05543081 4.15598515 4.18266682 4.09095730
mu_ini15 mu_ini16 mu_ini17 mu_ini18 mu_ini19
4.17922360 3.87211968 4.04509178 4.19385641 3.98403521
mu_ini20 mu_ini21 mu_ini22 mu_ini23 mu_ini24
4.08531659 4.19294203 4.29891769 4.21025211 4.16297457
mu_ini25 mu_ini26 mu_ini27 mu_ini28 mu_ini29
4.19265543 4.28925869 4.10752810 4.10957212 4.14953247
mu_ini30 mu_ini31 mu_ini32 mu_ini33 mu_ini34
4.09690570 4.34234547 4.18169575 4.01663339 4.32713905
mu_ini35 mu_ini36 mu_ini37 mu_ini38 mu_ini39
4.08121891 3.98256819 4.08658375 4.05942834 4.06988174
mu_ini40
4.05655031
这是有效的,但通常一些参数可以相关并且更难以从数据中识别,因此可以使用顺序方法(可以阅读Bolker et al. 2013以获取更多信息)。在这种情况下,我们逐步增加参数的数量,改善校准的每个阶段的优化的初始猜测。对于此示例,第一阶段仅估算alpha
和beta
,并使用猜测来估算增长率和环境的线性模型。然后,在第二阶段,我们使用第一次优化的估计值,并将年度偏差作为参数(gamma
)添加。最后,我们使用第二个优化的估计值并将初始期望值添加为参数。我们假设初始观测值已经非常接近并且开始时是一个很好的猜测,我们最后添加初始预期值,但另一方面我们并不清楚其余参数的值。
fit = fitModel(obs=n, env=env, T=T, L=L)
Phase 1: alpha and beta only
Optimal parameters:
alpha beta
0.18172961 0.06323379
neg-LogLikelihood: -5023687
Phase 2: alpha, beta and gamma
Optimal parameters:
alpha beta gamma1 gamma2 gamma3
0.20519928 0.06238850 -0.35908716 -0.21453015 -0.05661066
gamma4 gamma5 gamma6 gamma7 gamma8
0.18963170 0.17800563 0.34303170 0.28960181 -0.72374927
gamma9 gamma10 gamma11 gamma12 gamma13
0.28464203 0.16900331 -0.40719047 -0.01292168 -0.25535610
gamma14 gamma15 gamma16 gamma17 gamma18
0.28806711 -0.38924648 0.19224527 -0.07875934 0.10880154
gamma19
0.24518786
neg-LogLikelihood: -5041345
Phase 3: alpha, beta, gamma and mu_ini
Optimal parameters:
alpha beta gamma1 gamma2
0.1962334008 0.0545361273 -0.4298024242 -0.1984379386
gamma3 gamma4 gamma5 gamma6
0.0240318556 0.1909639571 0.1116636126 0.3465693397
gamma7 gamma8 gamma9 gamma10
0.3478695629 -0.7500599493 0.3600551021 0.1361405398
gamma11 gamma12 gamma13 gamma14
-0.3874453347 -0.0005839263 -0.2305008546 0.2819608670
gamma15 gamma16 gamma17 gamma18
-0.3615273177 0.1792020332 -0.0685681922 0.1203006457
gamma19 mu_ini1 mu_ini2 mu_ini3
0.2506129351 4.6639314468 4.7205977429 4.5802529032
mu_ini4 mu_ini5 mu_ini6 mu_ini7
4.4293994068 4.6182382472 4.7039110982 4.5668031666
mu_ini8 mu_ini9 mu_ini10 mu_ini11
4.7610910879 4.5844180026 4.7226353021 4.5823048717
mu_ini12 mu_ini13 mu_ini14 mu_ini15
4.6814189824 4.7130039559 4.6135420745 4.7100006841
mu_ini16 mu_ini17 mu_ini18 mu_ini19
4.4080115751 4.5758092977 4.7209394881 4.5150790425
mu_ini20 mu_ini21 mu_ini22 mu_ini23
4.6171948847 4.7141188899 4.8303375556 4.7392110431
mu_ini24 mu_ini25 mu_ini26 mu_ini27
4.6893526309 4.7237687961 4.8234804183 4.6333012324
mu_ini28 mu_ini29 mu_ini30 mu_ini31
4.6392335265 4.6817044754 4.6260620666 4.8713345071
mu_ini32 mu_ini33 mu_ini34 mu_ini35
4.7107116580 4.5471434622 4.8540773708 4.6129553933
mu_ini36 mu_ini37 mu_ini38 mu_ini39
4.5134108799 4.6231016082 4.5823454113 4.5969785420
mu_ini40
4.5835763300
neg-LogLikelihood: -5047251
比较模型的两个校准,我们可以看到第二个为目标函数提供较低的值。此外,比较真实'之间的相关性。年度偏差和估计值,我们对第二次校准具有更高的相关性:
cor(gamma, fit0$par$gamma)
[1] 0.8708379
cor(gamma, fit$par$gamma)
[1] 0.9871758
观察输出,我们可以看到在第一次校准中估计初始预期值(低估了所有站点)存在一些问题(使用实际数据,通常多相校准工作方式更好):
par(mfrow=c(3,2), mar=c(3,5,1,1), oma=c(1,1,1,1))
for(i in 1:4) {
ylim=c(0, 1.1*log(max(fit$fitted, n)))
plot(log(fit$fitted[,i]), type="l", col="blue", ylim=ylim,
ylab="mu (log)")
lines(log(fit0$fitted[,i]), col="green")
points(log(mu[,i]), col="red")
mtext(paste("Site ", i), 3, adj=0.05, line=-2)
if(i==3) {
mtext(c("observed", "fitModel0", "fitModel1"), 1, adj=0.95,
line=-1.5:-3.5, col=c("red", "green", "blue"), cex=0.8)
}
}
mus = rbind(mu_ini, fit$par$mu_ini, fit0$par$mu_ini)
barplot(mus, beside=TRUE, col=c("red", "blue", "green"),
ylab="Initial expected population",
xlab="Sites", border=NA)
gam = rbind(gamma, fit$par$gamma, fit0$par$gamma)
barplot(gam, beside=TRUE, col=c("red", "blue", "green"),
ylab="Annual deviates", border=NA)
最后,
system.time(fitModel(obs=n, env=env, T=T, L=L))
user system elapsed
9.85 0.00 9.85
这比@Thierry使用INLA(来自summary(model)
)提出的解决方案慢了四倍:
Time used:
Pre-processing Running inla Post-processing Total
0.1070 2.3131 0.0460 2.4661
然而,在字节编译我的函数后,我们得到:
user system elapsed
7.53 0.00 7.53
它的速度提高了24%,现在只比INLA方法慢3倍。不过,我认为即使是成千上万的实验也是合理的(我自己的模型只需要5天进行一次优化,所以也许我有偏见)并且由于我们使用模拟数据,我们可以比较参数的可靠性估计除计算机时间外。
# The functions -----------------------------------------------------------
require(compiler)
dynamics = function(par, obs, x, T, L) {
alpha = par$alpha
beta = par$beta
gamma = if(!is.null((par$gamma))) par$gamma else rep(0, T-1)
mu_ini = if(!is.null(par$mu_ini)) exp(par$mu_ini) else obs[1,]
mu = matrix(nrow=T, ncol=L)
mu[1,] = mu_ini
for(t in seq_len(T-1)) {
log_mu_new = log(mu[t,]) + alpha + beta*x[t,] + gamma[t]
mu[t+1, ] = exp(log_mu_new)
}
return(mu)
}
dynamics = cmpfun(dynamics)
reListPars = function(par) {
out = list()
out$alpha = as.numeric(par["alpha"])
out$beta = as.numeric(par["beta"])
if(!is.na(par["sd"])) out$sd = as.numeric(par["sd"])
gammas = as.numeric(par[grepl("gamma", names(par))])
if(length(gammas)>0) out$gamma = gammas
mu_inis = as.numeric(par[grepl("mu_ini", names(par))])
if(length(mu_inis)>0) out$mu_ini = mu_inis
return(out)
}
reListPars = cmpfun(reListPars)
.getLogLike = function(par, obs, env, T, L) {
par = reListPars(par)
if(is.null(par$sd)) {
par$sd = if(!is.null(par$gamma)) sd(par$gamma)+0.01 else 1
}
mu = dynamics(par=par, obs=obs, x=env, T=T, L=L)
logLike = sum(obs*log(mu) - mu) - sum(par$gamma^2/(2*par$sd^2))
return(-logLike)
}
.getLogLike = cmpfun(.getLogLike)
.getUpper = function(par) {
par$alpha = 10*par$alpha + 1
par$beta = 10*abs(par$beta) + 1
if(!is.null(par$gamma)) {
if(!is.null(par$sd)) sd = par$sd else sd=sd(par$gamma)
if(sd==0) sd=1
par$gamma = rep(qnorm(0.999, sd=sd), length(par$gamma))
}
if(!is.null(par$mu_ini)) par$mu_ini = 5*par$mu_ini
if(!is.null(par$sd)) par$sd = 10*par$sd
par = unlist(par)
return(par)
}
.getUpper = cmpfun(.getUpper)
.getLower = function(par) {
par$alpha = 0 # alpha>0?
par$beta = -10*abs(par$beta) - 1
if(!is.null(par$gamma)) {
if(!is.null(par$sd)) sd = par$sd else sd=sd(par$gamma)
if(sd==0) sd=1
par$gamma = rep(qnorm(1-0.999, sd=sd), length(par$gamma))
}
if(!is.null(par$mu_ini)) par$mu_ini = 0.2*par$mu_ini
if(!is.null(par$sd)) par$sd = 0.0001*par$sd
par = unlist(par)
return(par)
}
.getLower = cmpfun(.getLower)
fitModel = function(obs, env, T, L) {
r = log(obs[-1,]/obs[-T,])
guess = data.frame(rate=as.numeric(r), env=as.numeric(env[-T,]))
mod1 = lm(rate ~ env, data=guess)
output = list()
output$par = NULL
# Phase 1: alpha an beta only
cat("Phase 1: alpha and beta only\n")
par = list()
par$alpha = as.numeric(coef(mod1)[1])
par$beta = as.numeric(coef(mod1)[2])
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase1 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
# phase 2: alpha, beta and gamma
cat("Phase 2: alpha, beta and gamma\n")
optpar = reListPars(opt$par)
par$alpha = optpar$alpha
par$beta = optpar$beta
par$gamma = rep(0, T-1)
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase2 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
# phase 3: alpha, beta, gamma and mu_ini
cat("Phase 3: alpha, beta, gamma and mu_ini\n")
optpar = reListPars(opt$par)
par$alpha = optpar$alpha
par$beta = optpar$beta
par$gamma = optpar$gamma
par$mu_ini = log(obs[1,])
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par),
control=list(maxit=1000))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase3 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
output$par = reListPars(opt$par)
output$fitted = dynamics(par=output$par, obs=obs, x=env, T=T, L=L)
output$observed = obs
output$env = env
return(output)
}
fitModel = cmpfun(fitModel)
fitModel0 = function(obs, env, T, L) {
output = list()
output$par = NULL
par = list()
par$alpha = 0
par$beta = 0
par$gamma = rep(0, T-1)
par$mu_ini = log(obs[1,])
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase1 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
output$par = reListPars(opt$par)
output$fitted = dynamics(par=output$par, obs=obs, x=env, T=T, L=L)
output$observed = obs
output$env = env
return(output)
}
fitModel0 = cmpfun(fitModel0)
答案 1 :(得分:0)
模型公式与您给出的模型公式不同,但从问题的标题来看,似乎CRAN上hhh4
包中的surveillance
函数可能会引起关注。它允许您使用随机效应拟合泊松自回归模型。该功能的文档底部有一些示例。我认为目前固定效应必须限制为拦截,长期时间趋势和每个站点的季节性组件,但也许这对您有用。
答案 2 :(得分:0)
查看INLA包http://www.r-inla.org
它是贝叶斯,但使用集成嵌套拉普拉斯近似,这使得模型的速度可与频率模型(glm,glmm)的速度相比。
从Ricardo Oliveros-Ramos的mu
和env
开始,L = 40.首先准备数据集
dataset <- data.frame(
count = rpois(length(mu), lambda = mu),
year = rep(seq_len(T), L),
site = rep(seq_len(L), each = T),
env = as.vector(env)
)
library(reshape2)
n <- as.matrix(dcast(year ~ site, data = dataset, value.var = "count")[, -1])
dataset$year2 <- dataset$year
运行模型
library(INLA)
system.time(
model <- inla(
count ~
env +
f(year, model = "ar1", replicate = site) +
f(year2, model = "iid"),
data = dataset,
family = "poisson"
)
)
user system elapsed
0.18 0.14 3.77
将速度与里卡多的解决方案进行比较
system.time(test <- fitModel(obs=n, env=env, T=T, L=L))
user system elapsed
11.06 0.00 11.06
将速度与频繁的glmm(无自相关)进行比较
library(lme4)
system.time(
m <- glmer(
count ~ env + (1|site) + (1|year),
data = dataset,
family = poisson
)
)
user system elapsed
0.44 0.00 0.44
没有自相关的inla的速度
system.time(
model <- inla(
count ~
env +
f(site, model = "iid") +
f(year, model = "iid"),
data = dataset,
family = "poisson"
)
)
user system elapsed
0.19 0.11 2.09