Question

我一直在学习如何使用scikit学习使用桑坦德银行客户满意度竞赛：

https://www.kaggle.com/c/santander-customer-satisfaction

我已经运行网格搜索来调整XGBoost模型的参数并获得预测的roc_auc得分为0.83。当我针对保持集测试获胜模型时，似乎该模型没有任何预测能力并且得分为0.50。我必须在我的剧本中犯错，但无法找到出错的地方，无法理解在哪里看。

我的培训脚本如下：

import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import roc_auc_score


# reproducibility
seed = 342
np.random.seed(seed)

train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")

array = train_data.values

# Split-out validation dataset
X = array[:,0:369].astype(float)
Y = array[:,370].astype(int)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

# params fixed
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1
}

# grid search
grid_search = GridSearchCV(
    estimator=XGBClassifier(params_fixed, seed=seed, nthread=-1),
    param_grid=params_grid,
    cv=10,
    verbose=1,
    scoring='roc_auc'
)

grid_search.fit(X_train, Y_train)

print grid_search.grid_scores_
print grid_search.best_score_
print grid_search.best_estimator_

这给出了以下输出（我省略了很长的模型列表）：

0.83303461644
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=25, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=7, silent=True, subsample=1)

这里是用于计算保留数据得分的脚本：

import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import roc_auc_score


# reproducibility
seed = 342
np.random.seed(seed)

train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")

array = train_data.values

# Split-out validation dataset
X = array[:,0:369].astype(float)
Y = array[:,370].astype(int)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

watchlist  = [(X_train, Y_train), (X_validation, Y_validation)]

model = XGBClassifier(
        base_score=0.5, 
        colsample_bylevel=1, 
        colsample_bytree=1,
        gamma=0, 
        learning_rate=0.5, 
        max_delta_step=0, 
        max_depth=3,
        min_child_weight=1, 
        missing=None, 
        n_estimators=25, 
        nthread=-1,
        objective='binary:logistic', 
        reg_alpha=0, 
        reg_lambda=1,
        scale_pos_weight=1, 
        seed=7, 
        silent=True, 
        subsample=1
)

model.fit(X_train, Y_train, eval_metric="auc", eval_set=watchlist, verbose=True)

model.fit(X_train, Y_train)

predictions = model.predict(X_validation)

print roc_auc_score(Y_validation, predictions)

这输出0.503777444213，这与预期相反，应该输出较低的分数，但是相当接近0.83。

有人可以找到我出错的地方吗？

更新以下建议以绘制学习曲线

绘制学习曲线（假设我已正确解释）产生以下图表：

值来自上面定义的监视列表，我已经编辑了我已添加此代码的代码。我已经省略了该图的代码。

据我所知，这表明过度拟合并不是罪魁祸首，我怀疑这个错误与第一次生成验证计算的方式有关，但是这样做了只是我的感觉，我还没有完全理解我在做什么。

为了完整性，这里用于绘制学习曲线，无论好坏。我已经意识到应该从我找到代码的地方重写一些标签：

epochs = len(results['validation_0']['auc'])
x_axis = range(0, epochs)
y_axis = range(0, 1)
# plot log loss
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
pyplot.ylabel('auc')
pyplot.show()
# plot classification error
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
pyplot.show()

验证分数与XGBoost脚本上的预测分数不匹配

0 个答案: