I'm trying to train models in batches to improve memory usage.
Here's my example of incremental training for gradient boosting in xgboost
:
它使用xgb_model
分批训练。但是,这种培训创建的模型的性能仅与单批次培训的模型一样。
如何减少增量培训引起的错误?
我的增量训练:
def xgb_native_batch(batch_size=100):
"""Train in batches that update the same model"""
batches = int(np.ceil(len(y_train) / batch_size))
dtrain = xgb.DMatrix(data=X_train, label=y_train)
if XGB_MODEL_FILE:
# Save the model
bst = xgb.train(
params=xgb_train_params,
dtrain=dtrain,
num_boost_round=0
) # type: Booster
bst.save_model(XGB_MODEL_FILE)
else:
# OR just use an empty Booster class
bst = None
for i in range(batches):
start = i * batch_size
end = start + batch_size
dtrain = xgb.DMatrix(X_train[start:end, :], y_train[start:end])
bst = xgb.train(
dtrain=dtrain,
params=xgb_train_params,
xgb_model=XGB_MODEL_FILE or bst
) # type: Booster
if XGB_MODEL_FILE:
bst.save_model(XGB_MODEL_FILE)
dtest = xgb.DMatrix(data=X_test, label=y_test)
pr_y_test_hat = bst.predict(dtest)
return pr_y_test_hat
基于四个数据集的测试。我创建了这些模型:
xgb_native_bulk
是同时接受所有数据培训的参考模型。xgb_native_bulk_<N>
是在大小为N的子样本上训练的模型。xgb_native_batch_<N>
是连续训练的模型,分为小批量N(通过模型更新持续学习)的所有数据:度量:
make_classification: binary, N=3750
========================================
accuracy_score aurocc
algorithm
xgb_native_bulk 0.8624 0.933398
xgb_native_bulk_100 0.6192 0.669542
xgb_native_batch_100 0.6368 0.689123
xgb_native_bulk_500 0.7440 0.837590
xgb_native_batch_500 0.7528 0.829661
xgb_native_bulk_1000 0.7944 0.880586
xgb_native_batch_1000 0.8048 0.886607
load_breast_cancer: binary, N=426
========================================
accuracy_score aurocc
algorithm
xgb_native_bulk 0.958042 0.994902
xgb_native_bulk_100 0.930070 0.986037
xgb_native_batch_100 0.965035 0.989805
xgb_native_bulk_500 0.958042 0.994902
xgb_native_batch_500 0.958042 0.994902
xgb_native_bulk_1000 0.958042 0.994902
xgb_native_batch_1000 0.958042 0.994902
make_regression: reg, N=3750
========================================
mse
algorithm
xgb_native_bulk 5.513056e+04
xgb_native_bulk_100 1.209782e+05
xgb_native_batch_100 7.872892e+07
xgb_native_bulk_500 8.694831e+04
xgb_native_batch_500 1.150160e+05
xgb_native_bulk_1000 6.953936e+04
xgb_native_batch_1000 5.060867e+04
load_boston: reg, N=379
========================================
mse
algorithm
xgb_native_bulk 15.910990
xgb_native_bulk_100 25.160251
xgb_native_batch_100 16.931899
xgb_native_bulk_500 15.910990
xgb_native_batch_500 15.910990
xgb_native_bulk_1000 15.910990
xgb_native_batch_1000 15.910990
问题在于增量学习在长数据集和宽数据集方面表现不佳。例如,分类问题:
accuracy_score aurocc
algorithm
xgb_native_bulk 0.8624 0.933398
xgb_native_bulk_100 0.6192 0.669542
xgb_native_batch_100 0.6368 0.689123
同时训练100行的模型与100行批量训练的3750行之间没有区别。两者都远远不及同时训练3750行的参考模型。
答案 0 :(得分:0)
XGBoost中的“持续训练”指的是继续,例如,提升轮次,如单位测试中所示:
即使指定了xgb_model
,这些测试也会使用整个数据。然后,“完整模型”的错误率等于逐渐训练的模型的错误率。
当根据数据的子集更新模型时,它会像以前没有训练轮一样糟糕。
在“外部记忆”名称的讨论中发生了节省内存的增量训练。总的来说,常见问题涵盖了大数据集here的问题。