我一直在学习如何使用scikit学习使用桑坦德银行客户满意度竞赛:
https://www.kaggle.com/c/santander-customer-satisfaction
我已经运行网格搜索来调整XGBoost模型的参数并获得预测的roc_auc得分为0.83。当我针对保持集测试获胜模型时,似乎该模型没有任何预测能力并且得分为0.50。我必须在我的剧本中犯错,但无法找到出错的地方,无法理解在哪里看。
我的培训脚本如下:
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import roc_auc_score
# reproducibility
seed = 342
np.random.seed(seed)
train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")
array = train_data.values
# Split-out validation dataset
X = array[:,0:369].astype(float)
Y = array[:,370].astype(int)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)
params_grid = {
'max_depth': [1, 2, 3],
'n_estimators': [5, 10, 25, 50],
'learning_rate': np.linspace(1e-16, 1, 3)
}
# params fixed
params_fixed = {
'objective': 'binary:logistic',
'silent': 1
}
# grid search
grid_search = GridSearchCV(
estimator=XGBClassifier(params_fixed, seed=seed, nthread=-1),
param_grid=params_grid,
cv=10,
verbose=1,
scoring='roc_auc'
)
grid_search.fit(X_train, Y_train)
print grid_search.grid_scores_
print grid_search.best_score_
print grid_search.best_estimator_
这给出了以下输出(我省略了很长的模型列表):
0.83303461644
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=25, nthread=-1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=7, silent=True, subsample=1)
这里是用于计算保留数据得分的脚本:
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import roc_auc_score
# reproducibility
seed = 342
np.random.seed(seed)
train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")
array = train_data.values
# Split-out validation dataset
X = array[:,0:369].astype(float)
Y = array[:,370].astype(int)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)
watchlist = [(X_train, Y_train), (X_validation, Y_validation)]
model = XGBClassifier(
base_score=0.5,
colsample_bylevel=1,
colsample_bytree=1,
gamma=0,
learning_rate=0.5,
max_delta_step=0,
max_depth=3,
min_child_weight=1,
missing=None,
n_estimators=25,
nthread=-1,
objective='binary:logistic',
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
seed=7,
silent=True,
subsample=1
)
model.fit(X_train, Y_train, eval_metric="auc", eval_set=watchlist, verbose=True)
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print roc_auc_score(Y_validation, predictions)
这输出0.503777444213,这与预期相反,应该输出较低的分数,但是相当接近0.83。
有人可以找到我出错的地方吗?
更新以下建议以绘制学习曲线
绘制学习曲线(假设我已正确解释)产生以下图表:
值来自上面定义的监视列表,我已经编辑了我已添加此代码的代码。我已经省略了该图的代码。
据我所知,这表明过度拟合并不是罪魁祸首,我怀疑这个错误与第一次生成验证计算的方式有关,但是这样做了只是我的感觉,我还没有完全理解我在做什么。
为了完整性,这里用于绘制学习曲线,无论好坏。我已经意识到应该从我找到代码的地方重写一些标签:
epochs = len(results['validation_0']['auc'])
x_axis = range(0, epochs)
y_axis = range(0, 1)
# plot log loss
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
pyplot.ylabel('auc')
pyplot.show()
# plot classification error
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
pyplot.show()