Reduce error rates from incremental training in xgboost in Python

时间:2018-01-21 11:42:25

标签: python memory machine-learning xgboost

I'm trying to train models in batches to improve memory usage.

Here's my example of incremental training for gradient boosting in xgboost:

它使用xgb_model分批训练。但是,这种培训创建的模型的性能仅与单批次培训的模型一样。

如何减少增量培训引起的错误?

详细

我的增量训练:

def xgb_native_batch(batch_size=100):
    """Train in batches that update the same model"""

    batches = int(np.ceil(len(y_train) / batch_size))

    dtrain = xgb.DMatrix(data=X_train, label=y_train)
    if XGB_MODEL_FILE:
        # Save the model
        bst = xgb.train(
            params=xgb_train_params,
            dtrain=dtrain,
            num_boost_round=0
        )  # type: Booster
        bst.save_model(XGB_MODEL_FILE)
    else:
        # OR just use an empty Booster class
        bst = None

    for i in range(batches):

        start = i * batch_size
        end = start + batch_size
        dtrain = xgb.DMatrix(X_train[start:end, :], y_train[start:end])

        bst = xgb.train(
            dtrain=dtrain,
            params=xgb_train_params,
            xgb_model=XGB_MODEL_FILE or bst
        )  # type: Booster

        if XGB_MODEL_FILE:
            bst.save_model(XGB_MODEL_FILE)

    dtest = xgb.DMatrix(data=X_test, label=y_test)
    pr_y_test_hat = bst.predict(dtest)

    return pr_y_test_hat

测试

基于四个数据集的测试。我创建了这些模型:

  • xgb_native_bulk是同时接受所有数据培训的参考模型。
  • xgb_native_bulk_<N>是在大小为N的子样本上训练的模型。
  • xgb_native_batch_<N>是连续训练的模型,分为小批量N(通过模型更新持续学习)的所有数据:

度量:

make_classification: binary, N=3750
========================================
                       accuracy_score    aurocc
algorithm                                      
xgb_native_bulk                0.8624  0.933398
xgb_native_bulk_100            0.6192  0.669542
xgb_native_batch_100           0.6368  0.689123
xgb_native_bulk_500            0.7440  0.837590
xgb_native_batch_500           0.7528  0.829661
xgb_native_bulk_1000           0.7944  0.880586
xgb_native_batch_1000          0.8048  0.886607

load_breast_cancer: binary, N=426
========================================
                       accuracy_score    aurocc
algorithm                                      
xgb_native_bulk              0.958042  0.994902
xgb_native_bulk_100          0.930070  0.986037
xgb_native_batch_100         0.965035  0.989805
xgb_native_bulk_500          0.958042  0.994902
xgb_native_batch_500         0.958042  0.994902
xgb_native_bulk_1000         0.958042  0.994902
xgb_native_batch_1000        0.958042  0.994902

make_regression: reg, N=3750
========================================
                                mse
algorithm                          
xgb_native_bulk        5.513056e+04
xgb_native_bulk_100    1.209782e+05
xgb_native_batch_100   7.872892e+07
xgb_native_bulk_500    8.694831e+04
xgb_native_batch_500   1.150160e+05
xgb_native_bulk_1000   6.953936e+04
xgb_native_batch_1000  5.060867e+04

load_boston: reg, N=379
========================================
                             mse
algorithm                       
xgb_native_bulk        15.910990
xgb_native_bulk_100    25.160251
xgb_native_batch_100   16.931899
xgb_native_bulk_500    15.910990
xgb_native_batch_500   15.910990
xgb_native_bulk_1000   15.910990
xgb_native_batch_1000  15.910990

问题在于增量学习在长数据集和宽数据集方面表现不佳。例如,分类问题:

                       accuracy_score    aurocc
algorithm                                      
xgb_native_bulk                0.8624  0.933398
xgb_native_bulk_100            0.6192  0.669542
xgb_native_batch_100           0.6368  0.689123

同时训练100行的模型与100行批量训练的3750行之间没有区别。两者都远远不及同时训练3750行的参考模型。

参考

1 个答案:

答案 0 :(得分:0)

XGBoost需要持续学习中的整个数据集

XGBoost中的“持续训练”指的是继续,例如,提升轮次,如单位测试中所示:

即使指定了xgb_model,这些测试也会使用整个数据。然后,“完整模型”的错误率等于逐渐训练的模型的错误率。

当根据数据的子集更新模型时,它会像以前没有训练轮一样糟糕。

在“外部记忆”名称的讨论中发生了节省内存的增量训练。总的来说,常见问题涵盖了大数据集here的问题。