GAM模型预测与平均值之间的巨大差异

时间:2020-06-11 00:49:25

标签: r ggplot2 model gam

我有3个预测变量:年份(2010-2019),管理(Tabu / Open)和捕鱼方法(矛,钓线,拖钓或净钓)。我已将GAM设置为三个响应变量:工作量(小时),捕获量(重量,kg)和效率(单位工作量捕获量,CPUE)。对于所有这三个模型,管理部门都将Year设置为Gam,并且Management和Method也包含为固定效果(此组合的AIC最低,与我们的问题最相关)。这三个响应变量均存在偏差,需要进行LogX + 1转换。转换之后,权重和CPUE仍然偏斜,尽管可以接受残差直方图和残差与拟合值的关系图,也可以进行过度分散测试。

数据有很多可变性,因此模型只能解释5-10%的偏差。

使用visreg,我已经绘制了覆盖原始数据的模型。但是鉴于它的传播,这很难解释。相反,我随后使用geom_pointrange在模型上绘制了均值/ SE。

  • 对于“小时”图,模型非常适合每种年/管理类型的均值和误差。

  • 对于权重图,该模型预测的平均值远低于每年的平均值/ SE,并且Management =“ Tabu open”的斜率不正确。

  • 对于CPUE,模型形状大致正确,但远高于均值/ SE点范围。

鉴于它可以工作数小时,我猜想问题在于重量和CPUE的数据偏斜,即使模型验证是可以接受的。

为什么重量和CPUE的模型与均值/ SE数据之间的重叠很少,而小时数却没有?

还有一种更好的方法来绘制模型,而不是用1.原始数据或2.均值覆盖吗?

代码如下:

summary(reeffish)

      Year            Management       Method        Hours            Weight             CPUE        
 Min.   :2010   Always open:398   Line    :417   Min.   : 0.500   Min.   :  0.000   Min.   : 0.0000  
 1st Qu.:2011   Tabu open  :312   Net     : 49   1st Qu.: 3.000   1st Qu.:  1.196   1st Qu.: 0.2735  
 Median :2013                     Spear   :217   Median : 4.420   Median :  2.200   Median : 0.5092  
 Mean   :2013                     Trolling: 27   Mean   : 5.689   Mean   :  4.108   Mean   : 0.8665  
 3rd Qu.:2014                                    3rd Qu.: 6.567   3rd Qu.:  4.234   3rd Qu.: 0.9472  
 Max.   :2019                                    Max.   :67.330   Max.   :152.000   Max.   :14.7800                                                               
#Fit model
Hours.gam1 = gam(log(Hours+1) ~ s(Year, k=5,by=Management) + Management+Method, data= reeffish)
Weight.gam1 = gam(log(Weight+1) ~ s(Year, k=5,by=Management) + Management+Method, data= reeffish)
CPUE.gam1 = gam(log(CPUE+1) ~ s(Year, k=5,by=Management) + Management+Method,data= reeffish)
#Model Validation
#Hours
E3 <- resid(Hours.gam1); F3 <- fitted(Hours.gam1) # get the residuals and fitted values
par(mfrow=c(1,2),mar=c(5,4,3,3))  # this is setting up a multi-panel plot window
hist(E3, breaks=25,main="",xlab="Residuals")
plot(x=F3, y=E3, xlab="Fitted values", ylab="Residuals")
abline(0,0)

#Weight
E3 <- resid(Weight.gam1); F3 <- fitted(Weight.gam1) # get the residuals and fitted values
par(mfrow=c(1,2),mar=c(5,4,3,3))  # this is setting up a multi-panel plot window
hist(E3, breaks=25,main="",xlab="Residuals")
plot(x=F3, y=E3, xlab="Fitted values", ylab="Residuals")
abline(0,0)

#CPUE
E3 <- resid(CPUE.gam1); F3 <- fitted(CPUE.gam1) # get the residuals and fitted values
par(mfrow=c(1,2),mar=c(5,4,3,3))  # this is setting up a multi-panel plot window
hist(E3, breaks=25,main="",xlab="Residuals")
plot(x=F3, y=E3, xlab="Fitted values", ylab="Residuals")
abline(0,0)

Residual plots of Hours, Weight and CPUE

#Model Output
summary(Hours.gam1)
summary(Weight.gam1)
summary(CPUE.gam1)
Family: gaussian 
Link function: identity 

Formula:
log(Hours + 1) ~ s(Year, k = 5, by = Management) + Management + 
    Method

Parametric coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.637127   0.030477  53.717  < 2e-16 ***
ManagementTabu open  0.123398   0.051120   2.414 0.016040 *  
MethodNet            0.284026   0.080653   3.522 0.000457 ***
MethodSpear          0.001778   0.047798   0.037 0.970341    
MethodTrolling      -0.224008   0.103047  -2.174 0.030054 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                edf Ref.df     F p-value  
s(Year):ManagementAlways open 1.000  1.000 2.660  0.1033  
s(Year):ManagementTabu open   1.908  2.262 3.463  0.0219 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.043   Deviance explained = 5.24%
GCV = 0.27033  Scale est. = 0.2673    n = 704

Family: gaussian 
Link function: identity 

Formula:
log(Weight + 1) ~ s(Year, k = 5, by = Management) + Management + 
    Method

Parametric coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.14805    0.03944  29.107  < 2e-16 ***
ManagementTabu open -0.06480    0.06473  -1.001    0.317    
MethodNet            0.57534    0.10475   5.493 5.56e-08 ***
MethodSpear          0.31332    0.06173   5.076 4.96e-07 ***
MethodTrolling      -0.09969    0.13294  -0.750    0.454    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                edf Ref.df     F p-value
s(Year):ManagementAlways open 1.742  2.021 1.809   0.161
s(Year):ManagementTabu open   1.770  2.118 0.758   0.406

R-sq.(adj) =  0.0714   Deviance explained = 8.14%
GCV = 0.44714  Scale est. = 0.44173   n = 704

Family: gaussian 
Link function: identity 

Formula:
log(CPUE + 1) ~ s(Year, k = 5, by = Management) + Management + 
    Method

Parametric coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          0.46682    0.02180  21.418  < 2e-16 ***
ManagementTabu open -0.09232    0.03157  -2.924  0.00357 ** 
MethodNet            0.25574    0.05789   4.418 1.16e-05 ***
MethodSpear          0.18187    0.03409   5.335 1.29e-07 ***
MethodTrolling       0.04806    0.07345   0.654  0.51306    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                               edf Ref.df     F p-value
s(Year):ManagementAlways open 1.82  2.092 1.813   0.154
s(Year):ManagementTabu open   1.00  1.000 0.922   0.337

R-sq.(adj) =  0.0725   Deviance explained = 8.15%
GCV = 0.13622  Scale est. = 0.1347    n = 704
#Plot model
Hours.visreg=visreg(Hours.gam1,"Year", by="Management" , partial=FALSE, overlay=TRUE, trans=exp)
Weight.visreg=visreg(Weight.gam1,"Year", by="Management" , partial=FALSE, overlay=TRUE, trans=exp)
CPUE.visreg=visreg(CPUE.gam1,"Year", by="Management" , partial=FALSE, overlay=TRUE, trans=exp)

Model plots

visreg(Hours.gam1,"Year", by="Management" , partial=TRUE, overlay=TRUE, trans=exp)
visreg(Weight.gam1,"Year", by="Management" , partial=TRUE, overlay=TRUE, trans=exp)
visreg(CPUE.gam1,"Year", by="Management" , partial=TRUE, overlay=TRUE, trans=exp)

Model plots with Raw Data

#Plot model with Means overlaid, calculated outside of R
Means.alldata<- read.csv('reeffishMeans.csv', header= T, strip.white= T)
Means.alldata=Means.alldata%>% mutate(Year=as.integer(Year))

#Hours figure
Hours.reeffish.gam= ggplot(data=Means.alldata, aes(x=Year, y=Hours, fill=factor(Management), color=factor(Management)))+
  geom_ribbon(data= Hours.visreg$fit, aes(x=Year, ymin=visregLwr, ymax=visregUpr), alpha=0.5, linetype=1, size=0.2, colour=NA) +
  geom_line(data=Hours.visreg$fit, aes(x=Year,y=visregFit, group=Management, color=Management), lwd=1) +
  geom_pointrange(aes(ymin= Hours-HoursSE, ymax= Hours+HoursSE), width=.5, size=0.5, position=position_dodge(.05))+
  scale_colour_manual(values=c('deepskyblue4', 'coral2'))+
  scale_fill_manual(values=c('deepskyblue3', 'coral1'))+
  scale_y_continuous(name=expression('Fishing Effort (Hours)'), limits=c(2,8), breaks=c(2,4,6,8)) +
  scale_x_continuous(name=expression('Year'), limits=c(2009,2020), breaks=c(2010,2012,2014,2016,2018,2020))+
  theme_classic()+
  theme(axis.title.y= element_text(size=14), axis.text.y= element_text(size=12), axis.text.x= element_text(size=12), axis.title.x= element_text(size=14), legend.text=element_text(size=12), legend.title=element_text(size=14))

#Weight figure
Weight.reeffish.gam= ggplot(data=Means.alldata, aes(x=Year, y=Weight, fill=factor(Management), color=factor(Management)))+
  geom_ribbon(data= Weight.visreg$fit, aes(x=Year, ymin=visregLwr, ymax=visregUpr), alpha=0.5, linetype=1, size=0.2, colour=NA) +
  geom_line(data=Weight.visreg$fit, aes(x=Year,y=visregFit, group=Management, color=Management), lwd=1) +
  geom_pointrange(aes(ymin= Weight-WeightSE, ymax= Weight+WeightSE), width=.5, size=0.5, position=position_dodge(.05))+
  scale_colour_manual(values=c('deepskyblue4', 'coral2'))+
  scale_fill_manual(values=c('deepskyblue3', 'coral1'))+
  scale_y_continuous(name=expression('Catch (Kg)'), limits=c(0,8), breaks=c(2,4,6,8)) +
  scale_x_continuous(name=expression('Year'), limits=c(2009,2020), breaks=c(2010,2012,2014,2016,2018,2020))+
  theme_classic()+
  theme(axis.title.y= element_text(size=14), axis.text.y= element_text(size=12), axis.text.x= element_text(size=12), axis.title.x= element_text(size=14), legend.text=element_text(size=12), legend.title=element_text(size=14))

#CPUE figure
CPUE.reeffish.gam= ggplot(data=Means.alldata, aes(x=Year, y=CPUE, fill=factor(Management), color=factor(Management)))+
  geom_ribbon(data= CPUE.visreg$fit, aes(x=Year, ymin=visregLwr, ymax=visregUpr), alpha=0.5, linetype=1, size=0.2, colour=NA) +
  geom_line(data=CPUE.visreg$fit, aes(x=Year,y=visregFit, group=Management, color=Management), lwd=1) +
  geom_pointrange(aes(ymin= CPUE-CPUESE, ymax= CPUE+CPUESE), width=.5, size=0.5, position=position_dodge(.05))+
  scale_colour_manual(values=c('deepskyblue4', 'coral2'))+
  scale_fill_manual(values=c('deepskyblue3', 'coral1'))+
  scale_y_continuous(name=expression('CPUE'), limits=c(0,2), breaks=c(0,.5,1,1.5,2)) +
  scale_x_continuous(name=expression('Year'), limits=c(2009,2020), breaks=c(2010,2012,2014,2016,2018,2020))+
  theme_classic()+
  theme(axis.title.y= element_text(size=14), axis.text.y= element_text(size=12), axis.text.x= element_text(size=12), axis.title.x= element_text(size=14), legend.text=element_text(size=12), legend.title=element_text(size=14))

Hours.reeffish.gam
Weight.reeffish.gam
CPUE.reeffish.gam

Model plots with Means of data

1 个答案:

答案 0 :(得分:1)

此后,我发现了问题所在。由于对数据进行了log(x + 1)转换,但仅通过trans = exp进行了反向转换,因此模型预测全部归一。对于CPUE来说,这只是明显的不同,因为它们的单位要小得多。