我使用的是H2O 3.10.4.1
我试图使用来自其他模型的一些初步预测来使用GBM拟合伯努利模型,并且我的可能性比开始预测更差。我能够使用泰坦尼克号数据重现它。
我能够使用R< gbm来做我想做的事。 R的gbm.fit要求链路规模的偏移,不受限制,可能是非常高或非常低的负值。
但是,当我尝试在H2O GBM中执行相同操作时,会抛出错误: water.exceptions.H2OModelBuilderIllegalArgumentException:GBM模型的非法参数:GBM_model_R_1489164084643_3568。详细信息:字段上的ERRR:_offset_column:伯努利分布的偏移量不能大于1.
我的Jupyter笔记本在这里: Github
更新 我能够使用偏移量,但仅适用于ProbabilityLink小于1的数据帧。因为H2O抱怨它。请参阅笔记本中的单元格65-68。
我相信这是H2O中的一个错误。他们应该删除伯努利必须小于1的偏移量的要求。它可以是任何东西。然后它应该工作正常。
答案 0 :(得分:1)
<强>更新强>
对于旧版本的H2O(3.10.2或更低版本),对于使用H2O gbm&#39; offset_column
的伯努利分布,您必须使用小于1的值。 但是,对于较新的版本,您可以传入任何值。在您的情况下,使用伯努利分布,创建偏移列的一种方法是使用先前模型的预测logit值(正如您在评论中所说的那样)。
这是gbm偏移列的工作原理: 偏移是在模型训练期间使用的每行“偏差值”。对于高斯分布,偏移可以看作对响应(y)列的简单校正。模型学习预测响应列的(行)偏移,而不是学习预测响应(y行)。对于其他分布,在应用反向链接函数以获得实际响应值之前,在线性化空间中应用偏移校正。此选项不适用于多项分布。
以下是如何在玩具数据集上使用此参数的示例
(伯努利分布的例子)
library(h2o)
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])
# create a new offset column by taking the log of the response column
cars["offset"] <- as.h2o(rep(.5, dim(cars)[1]))
# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"
# split into train and validation sets
cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars.split[[1]]
valid <- cars.split[[2]]
# try using the `off_set` parameter:
# training_frame and validation_frame
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, offset_column = "offset",
validation_frame = valid, seed = 1234)
# print the auc for your model
print(h2o.auc(cars_gbm, valid = TRUE))
高斯示例(使用此选项更有意义)
library(h2o)
h2o.init()
# import the boston dataset:
# this dataset looks at features of the boston suburbs and predicts median housing prices
# the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Housing
boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
# set the predictor names and the response column name
predictors <- colnames(boston)[1:13]
# set the response column to "medv", the median value of owner-occupied homes in $1000's
response <- "medv"
# convert the chas column to a factor (chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
boston["chas"] <- as.factor(boston["chas"])
# create a new offset column by taking the log of the response column
boston["offset"] <- log(boston["medv"])
# split into train and validation sets
boston.splits <- h2o.splitFrame(data = boston, ratios = .8, seed = 1234)
train <- boston.splits[[1]]
valid <- boston.splits[[2]]
# try using the `offset_column` parameter:
# train your model, where you specify the offset_column
boston_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
validation_frame = valid,
offset_column = "offset",
seed = 1234)
# print the mse for validation set
print(h2o.mse(boston_gbm, valid = TRUE))