如何在R中的gbm模型中抵消曝光?

时间:2016-12-21 15:45:22

标签: r poisson gbm offset

我正在尝试将梯度增强机(GBM)用于保险索赔。观察结果具有不相等的曝光量,因此我尝试使用等于曝光对数的偏移量。我尝试了两种不同的方式:

  1. 在公式中添加偏移项。这导致nan用于每次迭代的训练和验证偏差。

  2. 使用offset功能中的gbm参数。此参数列在gbm.more下。这会导致出现未使用参数的错误消息。

  3. 我无法分享我公司的数据,但我使用MASS包中的保险数据表重现了这个问题。请参阅下面的代码和输出。

    library(MASS)
    library(gbm)
    
    data(Insurance)
    
    # Try using offset in the formula.
    fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
    
    fitgbm1 = gbm(fm1, distribution = "poisson",
                  data = Insurance,
                  n.trees = 10,
                  shrinkage = 0.1,
                  verbose = TRUE)
    
    # Try using offset in the gbm statement.
    fm2 = formula(Claims ~ District + Group + Age)
    offset2 = log(Insurance$Holders)
    
    fitgbm2 = gbm(fm2, distribution = "poisson",
                  data = Insurance,
                  n.trees = 10,
                  shrinkage = 0.1,
                  offset = offset2,
                  verbose = TRUE)
    

    然后输出:

    > source('D:/Rprojects/auto_tutorial/rcode/example_gbm.R')
    Iter   TrainDeviance   ValidDeviance   StepSize   Improve
         1     -347.8959             nan     0.1000    0.0904
         2     -348.2181             nan     0.1000    0.0814
         3     -348.3845             nan     0.1000    0.0616
         4     -348.5424             nan     0.1000    0.0333
         5     -348.6732             nan     0.1000    0.0850
         6     -348.7744             nan     0.1000    0.0610
         7     -348.8795             nan     0.1000    0.0633
         8     -348.9132             nan     0.1000   -0.0109
         9     -348.9200             nan     0.1000   -0.0212
        10     -349.0271             nan     0.1000    0.0267
    
    Error in gbm(fm2, distribution = "poisson", data = Insurance, n.trees = 10,  : 
      unused argument (offset = offset2)
    

    我的问题是我做错了什么?还有,还有另外一种方法吗?我在gbm函数中注意到了一个权重参数。我应该使用它吗?

1 个答案:

答案 0 :(得分:1)

如果指定训练分数小于1,则第一个建议有效。默认值为1,表示没有验证集。

library(MASS)
library(gbm)

data(Insurance)

# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))

fitgbm1 = gbm(fm1, distribution = "poisson",
              data = Insurance,
              n.trees = 10,
              shrinkage = 0.1,
              verbose = TRUE,
              train.fraction = .75)

结果

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1     -428.8293       -105.1735     0.1000    0.0888
     2     -429.0869       -105.3063     0.1000    0.0708
     3     -429.1805       -105.3941     0.1000    0.0486
     4     -429.3414       -105.4816     0.1000    0.0933
     5     -429.4934       -105.5432     0.1000    0.0566
     6     -429.6714       -105.5188     0.1000    0.1212
     7     -429.8470       -105.5200     0.1000    0.0833
     8     -429.9655       -105.6073     0.1000    0.0482
     9     -430.1367       -105.6003     0.1000    0.0473
    10     -430.2462       -105.6100     0.1000    0.0487