Question

我使用的是H2O 3.10.4.1

我试图使用来自其他模型的一些初步预测来使用GBM拟合伯努利模型，并且我的可能性比开始预测更差。我能够使用泰坦尼克号数据重现它。

我能够使用R＆lt; gbm来做我想做的事。 R的gbm.fit要求链路规模的偏移，不受限制，可能是非常高或非常低的负值。

但是，当我尝试在H2O GBM中执行相同操作时，会抛出错误： water.exceptions.H2OModelBuilderIllegalArgumentException：GBM模型的非法参数：GBM_model_R_1489164084643_3568。详细信息：字段上的ERRR：_offset_column：伯努利分布的偏移量不能大于1.

我的Jupyter笔记本在这里： Github

更新我能够使用偏移量，但仅适用于ProbabilityLink小于1的数据帧。因为H2O抱怨它。请参阅笔记本中的单元格65-68。

我相信这是H2O中的一个错误。他们应该删除伯努利必须小于1的偏移量的要求。它可以是任何东西。然后它应该工作正常。

Answer 1

<强>更新

对于旧版本的H2O（3.10.2或更低版本），对于使用H2O gbm＆＃39; offset_column的伯努利分布，您必须使用小于1的值。但是，对于较新的版本，您可以传入任何值。在您的情况下，使用伯努利分布，创建偏移列的一种方法是使用先前模型的预测logit值（正如您在评论中所说的那样）。

这是gbm偏移列的工作原理：偏移是在模型训练期间使用的每行“偏差值”。对于高斯分布，偏移可以看作对响应（y）列的简单校正。模型学习预测响应列的（行）偏移，而不是学习预测响应（y行）。对于其他分布，在应用反向链接函数以获得实际响应值之前，在线性化空间中应用偏移校正。此选项不适用于多项分布。

以下是如何在玩具数据集上使用此参数的示例

（伯努利分布的例子）

library(h2o)
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])

# create a new offset column by taking the log of the response column
cars["offset"] <- as.h2o(rep(.5, dim(cars)[1]))

# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"

# split into train and validation sets
cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars.split[[1]]
valid <- cars.split[[2]]

# try using the `off_set` parameter:
# training_frame and validation_frame
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, offset_column = "offset",
                  validation_frame = valid, seed = 1234)

# print the auc for your model
print(h2o.auc(cars_gbm, valid = TRUE))

高斯示例（使用此选项更有意义）

library(h2o)
h2o.init()

# import the boston dataset:
# this dataset looks at features of the boston suburbs and predicts         median housing prices
# the original dataset can be found at     https://archive.ics.uci.edu/ml/datasets/Housing
boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")

# set the predictor names and the response column name
predictors <- colnames(boston)[1:13]
# set the response column to "medv", the median value of owner-occupied     homes in $1000's
response <- "medv"

# convert the chas column to a factor (chas = Charles River dummy     variable (= 1 if tract bounds river; 0 otherwise))
boston["chas"] <- as.factor(boston["chas"])

# create a new offset column by taking the log of the response column
boston["offset"] <- log(boston["medv"])

# split into train and validation sets
boston.splits <- h2o.splitFrame(data =  boston, ratios = .8, seed = 1234)
train <- boston.splits[[1]]
valid <- boston.splits[[2]]

# try using the `offset_column` parameter:
# train your model, where you specify the offset_column
boston_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
               validation_frame = valid,
               offset_column = "offset",
               seed = 1234)

# print the mse for validation set
print(h2o.mse(boston_gbm, valid = TRUE))

h2o.gbm

1 个答案: