Question

我正在尝试在python（sklearn版本）中使用xgboost执行多类文本分类，但有时它会错误地告诉我功能名称不匹配。奇怪的是，它确实有效（可能是4次中的1次），但不确定性使我现在很难依赖这个解决方案，即使它显示出令人鼓舞的结果，甚至没有做任何真正的预处理处理

我在代码中提供了一些类似于我正在使用的示例性示例数据。我目前的代码如下：

更新了反映maxymoo建议的代码

import xgboost as xgb
import numpy as np
from sklearn.cross_validation import KFold, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

rng = np.random.RandomState(31337)    

y = np.array([0, 1, 2, 1, 0, 3, 1, 2, 3, 0])
X = np.array(['milk honey bear bear honey tigger',
          'tom jerry cartoon mouse cat cat WB',
          'peppa pig mommy daddy george peppa pig pig',
          'cartoon jerry tom silly',
          'bear honey hundred year woods',
          'ben holly elves fairies gaston fairy fairies castle king',
          'tom and jerry mouse WB',
          'peppa pig daddy pig rebecca rabit',
          'elves ben holly little kingdom king big people',
          'pot pot pot pot jar winnie pooh disney tigger bear'])

xgb_model = make_pipeline(CountVectorizer(), xgb.XGBClassifier())

kf = KFold(y.shape[0], n_folds=2, shuffle=True, random_state=rng)
for train_index, test_index in kf:
    xgb_model.fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    accuracy = accuracy_score(actuals, predictions)
    print accuracy

我倾向于得到的错误如下：

Traceback (most recent call last):
  File "main.py", line 95, in <module>
    predictions = xgb_model.predict(X[test_index])
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/sklearn.py", line 465, in predict
    ntree_limit=ntree_limit)
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 939, in predict
    self._validate_features(data)
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 1179, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24']
expected f26, f25 in input data

任何指针都会非常感激！

Answer 1

您需要确保仅使用已经过训练的功能对模型进行评分。通常的方法是使用Pipeline将矢量化器和模型打包在一起。这样，它们将同时进行训练，如果在测试数据中遇到新特征，矢量化器将忽略它（同时请注意，您不需要在交叉的每个阶段重新创建模型 - 验证，你只需初始化一次，然后在每次折叠时重新设置）：

from sklearn.pipeline import make_pipeline    

xgb_model = make_pipeline(CountVectoriser(), xgb.XGBClassifier())
for train_index, test_index in kf:
    xgb_model.fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    accuracy = accuracy_score(actuals, predictions)
    print accuracy

xgboost sklearn中的feature_names不匹配在多类文本分类期间

1 个答案: