我正在尝试在python(sklearn版本)中使用xgboost执行多类文本分类,但有时它会错误地告诉我功能名称不匹配。奇怪的是,它确实有效(可能是4次中的1次),但不确定性使我现在很难依赖这个解决方案,即使它显示出令人鼓舞的结果,甚至没有做任何真正的预处理处理
我在代码中提供了一些类似于我正在使用的示例性示例数据。我目前的代码如下:
更新了反映maxymoo建议的代码
import xgboost as xgb
import numpy as np
from sklearn.cross_validation import KFold, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
rng = np.random.RandomState(31337)
y = np.array([0, 1, 2, 1, 0, 3, 1, 2, 3, 0])
X = np.array(['milk honey bear bear honey tigger',
'tom jerry cartoon mouse cat cat WB',
'peppa pig mommy daddy george peppa pig pig',
'cartoon jerry tom silly',
'bear honey hundred year woods',
'ben holly elves fairies gaston fairy fairies castle king',
'tom and jerry mouse WB',
'peppa pig daddy pig rebecca rabit',
'elves ben holly little kingdom king big people',
'pot pot pot pot jar winnie pooh disney tigger bear'])
xgb_model = make_pipeline(CountVectorizer(), xgb.XGBClassifier())
kf = KFold(y.shape[0], n_folds=2, shuffle=True, random_state=rng)
for train_index, test_index in kf:
xgb_model.fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
accuracy = accuracy_score(actuals, predictions)
print accuracy
我倾向于得到的错误如下:
Traceback (most recent call last):
File "main.py", line 95, in <module>
predictions = xgb_model.predict(X[test_index])
File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/sklearn.py", line 465, in predict
ntree_limit=ntree_limit)
File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 939, in predict
self._validate_features(data)
File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 1179, in _validate_features
data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24']
expected f26, f25 in input data
任何指针都会非常感激!
答案 0 :(得分:0)
您需要确保仅使用已经过训练的功能对模型进行评分。通常的方法是使用Pipeline
将矢量化器和模型打包在一起。这样,它们将同时进行训练,如果在测试数据中遇到新特征,矢量化器将忽略它(同时请注意,您不需要在交叉的每个阶段重新创建模型 - 验证,你只需初始化一次,然后在每次折叠时重新设置):
from sklearn.pipeline import make_pipeline
xgb_model = make_pipeline(CountVectoriser(), xgb.XGBClassifier())
for train_index, test_index in kf:
xgb_model.fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
accuracy = accuracy_score(actuals, predictions)
print accuracy