Question

我正在尝试使用xgboost运行-using python - 在分类问题上，我在 numpy矩阵X （rows = observation＆amp; columns = features）和标签中有数据在 numpy array y 。因为我的数据稀疏，我想让它使用X的稀疏版本运行，但似乎我错过了一些错误发生。

以下是我的工作：

# Library import

import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from scipy.sparse import csr_matrix

# Converting to sparse data and running xgboost

X_csr = csr_matrix(X)
xgb1 = XGBClassifier()
xgtrain = xgb.DMatrix(X_csr, label = y )      #to work with the xgb format
xgtest = xgb.DMatrix(Xtest_csr)
xgb1.fit(xgtrain, y, eval_metric='auc')
dtrain_predictions = xgb1.predict(xgtest)

等...

现在我在尝试调整分类器时遇到错误：

File ".../xgboost/python-package/xgboost/sklearn.py", line 432, in fit
self._features_count = X.shape[1]

AttributeError: 'DMatrix' object has no attribute 'shape'

现在，我找了一段时间它可以来自哪里，并且相信它与我希望使用的稀疏格式有关。但它是什么，以及我如何解决它，我不知道。

我欢迎任何帮助或评论！非常感谢你

Answer 1

您正在使用xgboost scikit-learn API（http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn），因此您无需将数据转换为DMatrix以适合XGBClassifier（）。只需删除该行

xgtrain = xgb.DMatrix(X_csr, label = y )

应该有效：

type(X_csr) #scipy.sparse.csr.csr_matrix
type(y) #numpy.ndarray
xgb1 = xgb.XGBClassifier()
xgb1.fit(X_csr, y)

输出：

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
   gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
   min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
   objective='binary:logistic', reg_alpha=0, reg_lambda=1,
   scale_pos_weight=1, seed=0, silent=True, subsample=1)

Answer 2

我更喜欢使用XGBoost训练包装而不是XGBoost sklearn包装。您可以按如下方式创建分类器：

params = {
    # I'm assuming you are doing binary classification
    'objective':'binary:logistic'
    # any other training params here
    # full parameter list here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
}
booster = xgb.train(params, xgtrain, metrics=['auc'])

此API还具有内置交叉验证xgb.cv，可以更好地使用XGBoost。

https://xgboost.readthedocs.io/en/latest/get_started/index.html#python

此处有更多示例https://github.com/dmlc/xgboost/tree/master/demo/guide-python

希望这有帮助。

Answer 3

X_csr = csr_matrix(X)具有许多与X相同的属性，包括.shape。但它不是一个子类，而不是替代品。代码需要具有稀疏感知功能。 sklearn符合资格;事实上，它增加了许多自己的快速稀疏效用函数。

但我不知道xgb处理稀疏矩阵的效果如何，也不知道它如何与sklearn一起使用。

假设问题出在xgtrain，您需要查看其类型和属性。它与用xgb.DMatrix(X, label = y )制作的相比如何？

如果您需要某位不是xgboost用户的人的帮助，您必须提供有关代码中对象的更多信息。

Answer 4

由于DMatrix..num_col（）仅返回稀疏矩阵中非零列的数量而出现问题。
使用scipy.sparse.coo_matrix.tocsc将此矩阵转换为压缩的稀疏列格式。
您可以参考http://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543

XGBoost和稀疏矩阵

4 个答案: