在广义线性模型的stepAIC中,逻辑下标太长

时间:2018-10-18 17:32:57

标签: r glm

我正尝试在R中开发一个模型,该模型说明使用从软件包MASS中使用stepAIC函数选择的预测变量子集的函数所收获的(眼扭曲)成眼的数量。为此,请参见下面的数据框结构:

> str(dat)
'data.frame':   92036 obs. of  47 variables:
 $ month  : Factor w/ 7 levels "4","5","6","7",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ day    : int  1 2 13 13 13 15 16 16 16 16 ...
 $ year   : Factor w/ 29 levels "1989","1990",..: 28 28 28 28 28 28 28 28 28 28 ...
 $ area   : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ loc    : int  42 13 42 42 31 42 31 42 42 12 ...
 $ dow    : int  5 6 3 3 3 5 6 6 6 6 ...
 $ type   : Factor w/ 3 levels "1","2","3": 1 3 2 3 1 3 1 1 1 1 ...
 $ ang    : int  3 8 3 6 3 6 2 2 3 3 ...
 $ grid   : int  903 801 903 803 903 804 903 903 903 801 ...
 $ zone   : int  18 10 18 16 18 16 14 12 14 14 ...
 $ seek   : int  334 334 334 334 334 334 334 334 334 334 ...
 $ tis    : num  10 7 6 7.5 9 7 8 7.5 8 6.5 ...
 $ tif    : num  15.5 12 13.5 14 13 15.5 11.5 12 12 10.5 ...
 $ tii    : num  15.5 12.5 14 14.5 13.5 15.5 11.5 12 12 10.5 ...
 $ F1     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F2     : int  5 32 12 24 0 11 8 2 5 1 ...
 $ F3     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F4     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F5     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F6     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F7     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F8     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F9     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F10    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F11    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F12    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F13    : int  0 1 1 0 0 0 2 0 1 0 ...
 $ F14    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F15    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F16    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F17    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F18    : int  0 0 0 0 0 0 0 1 0 0 ...
 $ F19    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F20    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F21    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ F22    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ method : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 1 1 1 1 ...
 $ WDWE   : Factor w/ 2 levels "1","2": 1 2 1 1 1 1 2 2 2 2 ...
 $ anghrs : num  16.5 40 22.5 39 12 51 7 9 12 12 ...
 $ X      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ basin  : Factor w/ 3 levels "east central",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ totwal : int  5 33 13 24 0 11 10 2 6 1 ...
 $ time   : num  12.75 9.5 9.75 10.75 11 ...
 $ TOD    : Factor w/ 2 levels "AM","PM": 2 1 1 1 1 1 1 1 1 1 ...
 $ triplen: num  5.5 5 7.5 6.5 4 8.5 3.5 4.5 4 4 ...
 $ effort : num  16.5 40 22.5 39 12 51 7 9 12 12 ...
 $ WAE_POS: num  1 1 1 1 0 1 1 1 1 1 ...

我正在考虑的模型的主要影响如下:

f_main_effects <- as.formula(paste("totwal ~ year + month + TOD + ang + triplen + method + WDWE + basin + type + offset(log(effort))"))

有关模型和摘要,请参见下文:

> res_WAE_poiss <- glm(formula = f_main_effects, data = dat, family=poisson(link="log"))
Call:
glm(formula = f_main_effects, family = poisson(link = "log"), 
    data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-13.053   -2.354   -0.815    1.235   36.543  

Coefficients:
                    Estimate Std. Error  z value Pr(>|z|)    
(Intercept)       -0.7710820  0.0099014  -77.876  < 2e-16 ***
year1990          -0.3666092  0.0075276  -48.702  < 2e-16 ***
year1991          -0.5909582  0.0079259  -74.561  < 2e-16 ***
year1992          -0.2591255  0.0073414  -35.296  < 2e-16 ***
year1993          -0.0669383  0.0063155  -10.599  < 2e-16 ***
year1994          -0.3658504  0.0070433  -51.943  < 2e-16 ***
year1995          -0.2206199  0.0066573  -33.139  < 2e-16 ***
year1996          -0.1035683  0.0063993  -16.184  < 2e-16 ***
year1997          -0.4919732  0.0072490  -67.867  < 2e-16 ***
year1998           0.1029357  0.0065669   15.675  < 2e-16 ***
year1999          -0.4383308  0.0071890  -60.972  < 2e-16 ***
year2000          -0.4835465  0.0074930  -64.533  < 2e-16 ***
year2001          -0.2362845  0.0077348  -30.548  < 2e-16 ***
year2002          -0.4938362  0.0076850  -64.260  < 2e-16 ***
year2003          -0.3045370  0.0083361  -36.532  < 2e-16 ***
year2004          -0.1870754  0.0067665  -27.647  < 2e-16 ***
year2005           0.0084350  0.0074392    1.134    0.257    
year2006           0.1609717  0.0063515   25.344  < 2e-16 ***
year2007           0.0352623  0.0068963    5.113 3.17e-07 ***
year2008          -0.1079411  0.0077131  -13.995  < 2e-16 ***
year2009          -0.2403672  0.0082776  -29.038  < 2e-16 ***
year2016           0.2797565  0.0061268   45.661  < 2e-16 ***
year2017           0.7801561  0.0057861  134.834  < 2e-16 ***
month5             0.2492277  0.0071682   34.769  < 2e-16 ***
month6             0.5002337  0.0069455   72.022  < 2e-16 ***
month7             0.5930889  0.0070208   84.476  < 2e-16 ***
month8             0.4875154  0.0074904   65.086  < 2e-16 ***
month9             0.3895295  0.0086433   45.067  < 2e-16 ***
month10            0.1483472  0.0108386   13.687  < 2e-16 ***
TODPM             -0.1060913  0.0024875  -42.650  < 2e-16 ***
ang               -0.0328293  0.0008702  -37.726  < 2e-16 ***
triplen           -0.0924310  0.0006020 -153.527  < 2e-16 ***
method2            0.2659051  0.0027271   97.506  < 2e-16 ***
WDWE2             -0.0662235  0.0020664  -32.048  < 2e-16 ***
basinwest          0.3995765  0.0036221  110.316  < 2e-16 ***
basinwest central  0.1031291  0.0038891   26.518  < 2e-16 ***
type2              0.2337072  0.0032363   72.214  < 2e-16 ***
type3              0.5876307  0.0038143  154.060  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 865844  on 79916  degrees of freedom
Residual deviance: 602809  on 79879  degrees of freedom
  (12119 observations deleted due to missingness)
AIC: 863541

Number of Fisher Scoring iterations: 5

然后,我尝试使用stepAIC考虑二阶交互并限制其始终包含年份和精力来选择一组预测变量。

res_WAE_pois_step <-  stepAIC(glm(totwal ~ year + offset(log(effort)), data = dat, family=poisson(link="log")), scope=list(upper = totwal ~ (year + month + TOD + WDWE + method + basin + ang + triplen)^2 + offset(log(effort))), direction = "forward", trace=5)

这会导致以下错误:

Start:  AIC=1119511
totwal ~ year + offset(log(effort))

Error in x[good, , drop = FALSE] : (subscript) logical subscript too long

我不确定在给定情况下如何解释此错误。值得一提的是,同一脚本在共享相同结构但观测值较少的数据集上可以正常运行。有谁知道是什么原因导致此错误,或者对如何解决该问题有建议?请让我知道更多信息是否有用,但是我不能合法共享数据。

感谢您的时间和帮助。

更新: 问题已经解决。 WDWE值在2010-2015年是不适用的,那时我不应该这样做。我能够用完整的数据更新数据框,脚本再次运行良好。就是说,有人可以帮助我理解1)为什么glm函数会默默地删除那些年而不是抛出错误?和2)如何根据新信息理解从stepAIC收到的错误?这些错误是否也是这些年来默默放弃的结果?

0 个答案:

没有答案