我有一个非常小的数据集:15列,3500行,并且我一直看到,h2o中的xgboost比h2o AutoML训练的模型更好。我正在使用H2O 3.26.0.2和Flow UI。
H2O XGBoost只需几秒钟即可完成,而AutoML需要的时间(20分钟)会持续很长时间,并且总是给我带来较差的性能。
我承认数据集可能并不完美,但我希望带有gridsearch的AutoML会比h2o XGBoost更好(或更优)。我的想法是AutoML将训练多个XGBoost模型并在超参数上进行格网搜索,因此它应该相似,对吧?
对于AutoML和XGBoost,我使用相同的训练数据集和相同的响应列。
使用XGBoost进行实验的代码为:
import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o_frame = h2o.import_file(path="myFile.csv")
feature_columns = h2o_frame.columns
label_column = "responseColumn"
feature_columns.remove(label_column)
xgb = H2OXGBoostEstimator(nfolds=10, seed=1)
xgb.train(x=feature_columns, y=label_column, training_frame=h2o_frame)
# now export metrics to file
MRD = xgb.mean_residual_deviance()
RMSE= xgb.rmse()
MSE= xgb.mse()
MAE= xgb.mae()
RMSLE= xgb.rmsle()
header = ['model','mean_residual_deviance','rmse','mse','mae','rmsle']
with open('metrics.out', mode='w') as result_file:
writer = csv.writer(result_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(header)
writer.writerow(['H2O_XGBoost', MRD, RMSE, MSE, MAE, RMSLE])
使用AutoML进行实验的代码为:
import h2o
from h2o.automl import H2OAutoML
h2o_frame = h2o.import_file(path="myFile.csv")
feature_columns = h2o_frame.columns
label_column = "responseColumn"
feature_columns.remove(label_column)
aml = H2OAutoML(seed=1, nfolds=10, exclude_algos=["StackedEnsemble"], max_models=20)
aml.train(x=feature_columns, y=label_column, training_frame=h2o_frame)
# now export metrics to file
h2o.export_file(aml.leaderboard, "metrics.out", force = True, parts = 1)
尝试对AutoML使用不同的nfold,更多模型,从而增加了早期停止回合的次数。我尝试从AutoML中排除所有算法(XGBoost除外),但仍然得到相同的结果。
以下是结果的差异:
H2O XGBoost:
model xgboost-5a8f9766-940c-4e5c-b57d-62b186f4c058
model_checksum 7409831159060775248
frame train_set_v01.hex
frame_checksum 6864971999838167226
description ·
model_category Regression
scoring_time 1566296468447
predictions ·
MSE 252.265021
RMSE 15.882853
nobs 3476
custom_metric_name ·
custom_metric_value 0
r2 0.726871
mean_residual_deviance 252.265021
mae 10.709369
rmsle NaN
xgboost-5a8f9766-940c-4e5c-b57d-62b186f4c058的XGBoost原生参数:
name value
silent true
eta 0.3
colsample_bylevel 1
objective reg:linear
min_child_weight 1
nthread 8
seed -1058380797
max_depth 6
colsample_bytree 1
lambda 1
gamma 0
alpha 0
booster gbtree
grow_policy depthwise
nround 50
subsample 1
max_delta_step 0
tree_method auto
H2O AutoML(获奖模型):
model StackedEnsemble_AllModels_AutoML_20190819_235446
model_checksum -6727284429527535576
frame automl_training_train_set_v01.hex
frame_checksum 6864971999838167226
description ·
model_category Regression
scoring_time 1566256209073
predictions ·
MSE 332.146239
RMSE 18.224880
nobs 3476
custom_metric_name ·
custom_metric_value 0
r2 0.640383
mean_residual_deviance 332.146239
mae 12.927023
rmsle 1.225650
residual_deviance 1154540.326762
null_deviance 3210476.302359
AIC 30070.640602
null_degrees_of_freedom 3475
residual_degrees_of_freedom 3464
以及同一AutoML中评分最高的XGBoost模型(在排行榜中排名第三):
model XGBoost_grid_1_AutoML_20190819_235446_model_5
model_checksum 8047828446507408480
frame automl_training_train_set_v01.hex
frame_checksum 6864971999838167226
description ·
model_category Regression
scoring_time 1566255442068
predictions ·
MSE 616.910151
RMSE 24.837676
nobs 3476
custom_metric_name ·
custom_metric_value 0
r2 0.332068
mean_residual_deviance 616.910151
mae 17.442629
rmsle 1.325149
XGBoost本机参数(对于AutoML中的XGBoost_grid_1_AutoML_20190819_235446_model_5):
name value
silent true
normalize_type tree
eta 0.05
objective reg:linear
colsample_bylevel 0.8
nthread 8
seed 940795529
min_child_weight 15
rate_drop 0
one_drop 0
sample_type uniform
max_depth 20
colsample_bytree 1
lambda 100
gamma 0
alpha 0.1
booster dart
grow_policy depthwise
skip_drop 0
nround 120
subsample 0.8
max_delta_step 0
tree_method auto
答案 0 :(得分:1)
这里的问题是,您正在比较XGBoost的培训指标和AutoML模型的 CV指标。
您为手动XGBoost模型发布的代码提供了培训指标。相反,如果您想与AutoML中的模型性能进行公平比较(您会在AutoML排行榜中默认报告CV指标,这就是您在代码中报告的内容),则需要获取CV指标。 / p>
更改此:
# now export metrics to file
MRD = xgb.mean_residual_deviance()
RMSE= xgb.rmse()
MSE= xgb.mse()
MAE= xgb.mae()
RMSLE= xgb.rmsle()
收件人:
# now export metrics to file
MRD = xgb.mean_residual_deviance(xval=True)
RMSE= xgb.rmse(xval=True)
MSE= xgb.mse(xval=True)
MAE= xgb.mae(xval=True)
RMSLE= xgb.rmsle(xval=True)
指标及其返回内容的描述在Python module docs中。
进行此更改后,您应该会看到问题已解决,并且在手动XGBoost模型和AutoML模型之间具有可比的性能。