XGBoost测试功能与培训功能不匹配

时间:2017-01-10 22:22:30

标签: python machine-learning scikit-learn xgboost

我似乎误解了XGBoost应该如何使用。我的模型训练看起来很顺利,但是当我将它应用到一个保留集时,代码会抱怨它正在被提供新的变量。

我正在开展已经关闭的Kaggle比赛“Santander Customer Satisfaction”比赛。您可以在此处查看数据集:

https://www.kaggle.com/c/santander-customer-satisfaction/data

这是我正在运行的代码:

[0] validation_0-auc:0.795057   validation_1-auc:0.791364
[1] validation_0-auc:0.804837   validation_1-auc:0.801891
[2] validation_0-auc:0.8113 validation_1-auc:0.809618
[3] validation_0-auc:0.813582   validation_1-auc:0.812916
[4] validation_0-auc:0.81944    validation_1-auc:0.817799
[5] validation_0-auc:0.822929   validation_1-auc:0.821458
[6] validation_0-auc:0.826112   validation_1-auc:0.82617
[7] validation_0-auc:0.830241   validation_1-auc:0.829682
[8] validation_0-auc:0.832868   validation_1-auc:0.832509
[9] validation_0-auc:0.835153   validation_1-auc:0.835
[10]    validation_0-auc:0.836704   validation_1-auc:0.834465
[11]    validation_0-auc:0.837821   validation_1-auc:0.834192
[12]    validation_0-auc:0.839605   validation_1-auc:0.834907
[13]    validation_0-auc:0.841186   validation_1-auc:0.836206
[14]    validation_0-auc:0.842604   validation_1-auc:0.836774
[15]    validation_0-auc:0.843251   validation_1-auc:0.837425
[16]    validation_0-auc:0.844243   validation_1-auc:0.837063
[17]    validation_0-auc:0.844695   validation_1-auc:0.837344
[18]    validation_0-auc:0.845865   validation_1-auc:0.838409
[19]    validation_0-auc:0.846729   validation_1-auc:0.837329
[20]    validation_0-auc:0.847179   validation_1-auc:0.836978
[21]    validation_0-auc:0.847953   validation_1-auc:0.83628
[22]    validation_0-auc:0.848195   validation_1-auc:0.836093
[23]    validation_0-auc:0.848502   validation_1-auc:0.836251
[24]    validation_0-auc:0.848797   validation_1-auc:0.83647
Traceback (most recent call last):
  File "so-script.py", line 74, in <module>
    probs = model.predict_proba(test_data)
  File "/usr/local/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/sklearn.py", line 477, in predict_proba
    ntree_limit=ntree_limit)
  File "/usr/local/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 941, in predict
    self._validate_features(data)
  File "/usr/local/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 1181, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2',... long list of effs...

', 'f366'] [u'var3', u'var15', u'im ...long list of named variables...
expected f169, f168, f161, f160, ... long list of effs ....

... , f71, f72, f73 in input data
training data did not have the following fields: imp_ent_var16_ult1,      num_aport_var13_hace3, ... long list of named variables

以下是脚本输出的内容:

{{1}}

据我所知,它正在命名所有变量,表明我错误地应用了库,但我无法弄清楚我做错了什么。在应用算法之前,其他脚本往往会对数据应用更多处理,但我不明白为什么这会成为问题,因为我将火车/测试留在同一状态。

有谁可以指出我做错了什么?

0 个答案:

没有答案