我似乎误解了XGBoost应该如何使用。我的模型训练看起来很顺利,但是当我将它应用到一个保留集时,代码会抱怨它正在被提供新的变量。
我正在开展已经关闭的Kaggle比赛“Santander Customer Satisfaction”比赛。您可以在此处查看数据集:
https://www.kaggle.com/c/santander-customer-satisfaction/data
这是我正在运行的代码:
[0] validation_0-auc:0.795057 validation_1-auc:0.791364
[1] validation_0-auc:0.804837 validation_1-auc:0.801891
[2] validation_0-auc:0.8113 validation_1-auc:0.809618
[3] validation_0-auc:0.813582 validation_1-auc:0.812916
[4] validation_0-auc:0.81944 validation_1-auc:0.817799
[5] validation_0-auc:0.822929 validation_1-auc:0.821458
[6] validation_0-auc:0.826112 validation_1-auc:0.82617
[7] validation_0-auc:0.830241 validation_1-auc:0.829682
[8] validation_0-auc:0.832868 validation_1-auc:0.832509
[9] validation_0-auc:0.835153 validation_1-auc:0.835
[10] validation_0-auc:0.836704 validation_1-auc:0.834465
[11] validation_0-auc:0.837821 validation_1-auc:0.834192
[12] validation_0-auc:0.839605 validation_1-auc:0.834907
[13] validation_0-auc:0.841186 validation_1-auc:0.836206
[14] validation_0-auc:0.842604 validation_1-auc:0.836774
[15] validation_0-auc:0.843251 validation_1-auc:0.837425
[16] validation_0-auc:0.844243 validation_1-auc:0.837063
[17] validation_0-auc:0.844695 validation_1-auc:0.837344
[18] validation_0-auc:0.845865 validation_1-auc:0.838409
[19] validation_0-auc:0.846729 validation_1-auc:0.837329
[20] validation_0-auc:0.847179 validation_1-auc:0.836978
[21] validation_0-auc:0.847953 validation_1-auc:0.83628
[22] validation_0-auc:0.848195 validation_1-auc:0.836093
[23] validation_0-auc:0.848502 validation_1-auc:0.836251
[24] validation_0-auc:0.848797 validation_1-auc:0.83647
Traceback (most recent call last):
File "so-script.py", line 74, in <module>
probs = model.predict_proba(test_data)
File "/usr/local/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/sklearn.py", line 477, in predict_proba
ntree_limit=ntree_limit)
File "/usr/local/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 941, in predict
self._validate_features(data)
File "/usr/local/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 1181, in _validate_features
data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2',... long list of effs...
', 'f366'] [u'var3', u'var15', u'im ...long list of named variables...
expected f169, f168, f161, f160, ... long list of effs ....
... , f71, f72, f73 in input data
training data did not have the following fields: imp_ent_var16_ult1, num_aport_var13_hace3, ... long list of named variables
以下是脚本输出的内容:
{{1}}
据我所知,它正在命名所有变量,表明我错误地应用了库,但我无法弄清楚我做错了什么。在应用算法之前,其他脚本往往会对数据应用更多处理,但我不明白为什么这会成为问题,因为我将火车/测试留在同一状态。
有谁可以指出我做错了什么?