AUC与GridSearchCV AUC有何不同?

时间:2017-10-25 19:07:50

标签: python scikit-learn cross-validation grid-search auc

我正在sci-kit学习中构建一个MLPClassifier模型。我使用gridSearchCV和roc_auc对模型进行评分。平均火车和考试成绩约为0.76,还不错。 cv_results_的输出是:

Train set AUC:  0.553465272412
Grid best score (AUC):  0.757236688092
Grid best parameter (max. AUC):  {'hidden_layer_sizes': 10}

{   'mean_fit_time': array([63.54, 136.37, 136.32, 119.23, 121.38, 124.03]),
    'mean_score_time': array([ 0.04,  0.04,  0.04,  0.05,  0.05,  0.06]),
    'mean_test_score': array([ 0.76,  0.74,  0.75,  0.76,  0.76,  0.76]),
    'mean_train_score': array([ 0.76,  0.76,  0.76,  0.77,  0.77,  0.77]),
    'param_hidden_layer_sizes': masked_array(data = [5 (5, 5) (5, 10) 10 (10, 5) (10, 10)],
             mask = [False False False False False False],
       fill_value = ?)
,
    'params': [   {'hidden_layer_sizes': 5},
                  {'hidden_layer_sizes': (5, 5)},
                  {'hidden_layer_sizes': (5, 10)},
                  {'hidden_layer_sizes': 10},
                  {'hidden_layer_sizes': (10, 5)},
                  {'hidden_layer_sizes': (10, 10)}],
    'rank_test_score': array([   2,    6,    5,    1,    4,    3]),
    'split0_test_score': array([ 0.76,  0.75,  0.75,  0.76,  0.76,  0.76]),
    'split0_train_score': array([ 0.76,  0.75,  0.75,  0.76,  0.76,  0.76]),
    'split1_test_score': array([ 0.77,  0.76,  0.76,  0.77,  0.76,  0.76]),
    'split1_train_score': array([ 0.76,  0.75,  0.75,  0.76,  0.76,  0.76]),
    'split2_test_score': array([ 0.74,  0.72,  0.73,  0.74,  0.74,  0.75]),
    'split2_train_score': array([ 0.77,  0.77,  0.77,  0.77,  0.77,  0.77]),
    'std_fit_time': array([47.59,  1.29,  1.86,  3.43,  2.49,  9.22]),
    'std_score_time': array([ 0.01,  0.01,  0.01,  0.00,  0.00,  0.01]),
    'std_test_score': array([ 0.01,  0.01,  0.01,  0.01,  0.01,  0.01]),
    'std_train_score': array([ 0.01,  0.01,  0.01,  0.01,  0.01,  0.00])}

正如你所看到的,我使用了3的KFold。有趣的是,手动计算的火车组的roc_auc_score报告为0.55,而平均火车得分报告为~0.76。生成此输出的代码是:

def model_mlp (X_train, y_train, verbose=True, random_state = 42):
    grid_values = {'hidden_layer_sizes': [(5), (5,5), (5, 10),
                                          (10), (10, 5), (10, 10)]}

    # MLP requires scaling of all predictors
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)

    mlp = MLPClassifier(solver='adam', learning_rate_init=1e-4,
                        max_iter=200,
                        verbose=False,
                        random_state=random_state)
    # perform the grid search
    grid_auc = GridSearchCV(mlp, 
                            param_grid=grid_values,
                            scoring='roc_auc', 
                            verbose=2, n_jobs=-1)
    grid_auc.fit(X_train, y_train)
    y_hat = grid_auc.predict(X_train)

    # print out the results
    if verbose:
        print('Train set AUC: ', roc_auc_score(y_train, y_hat))
        print('Grid best score (AUC): ', grid_auc.best_score_)
        print('Grid best parameter (max. AUC): ', grid_auc.best_params_)
        print('')
        pp = pprint.PrettyPrinter(indent=4)
        pp.pprint (grid_auc.cv_results_)
        print ('MLPClassifier fitted, {:.2f} seconds used'.format (time.time () - t))

    return grid_auc.best_estimator_

由于这种差异,我决定“模仿”#39; GridSearchCV例程,得到以下结果:

Shape X_train: (107119, 15)
Shape y_train: (107119,)
Shape X_val: (52761, 15)
Shape y_val: (52761,)
       layers    roc-auc
  Seq  l1  l2  train   test iters runtime
    1   5   0 0.5522 0.5488    85   20.54
    2   5   5 0.5542 0.5513    80   27.10
    3   5  10 0.5544 0.5521    83   28.56
    4  10   0 0.5532 0.5516    61   15.24
    5  10   5 0.5540 0.5518    54   19.86
    6  10  10 0.5507 0.5474    56   21.09

分数都在0.55左右,与上面代码中的手动计算一致。让我感到惊讶的是结果没有变化。看起来好像我犯了一些错误,但我找不到一个,看看代码:

def simple_mlp (X, y, verbose=True, random_state = 42):
    def do_mlp (X_t, X_v, y_t, y_v, n, l1, l2=None):
        if l2 is None:
            layers = (l1)
            l2 = 0
        else:
            layers = (l1, l2)

        t = time.time ()
        mlp = MLPClassifier(solver='adam', learning_rate_init=1e-4,
                            hidden_layer_sizes=layers,
                            max_iter=200,
                            verbose=False,
                            random_state=random_state)
        mlp.fit(X_t, y_t)
        y_hat_train = mlp.predict(X_t)
        y_hat_val = mlp.predict(X_v)
        if verbose:
            av = 'samples'
            acc_trn = roc_auc_score(y_train, y_hat_train, average=av)
            acc_tst = roc_auc_score(y_val, y_hat_val, average=av)
            print ("{:5d}{:4d}{:4d}{:7.4f}{:7.4f}{:9d}{:8.2f}"
                   .format(n, l1, l2, acc_trn, acc_tst,  mlp.n_iter_, time.time() - t))
        return mlp, n + 1

    X_train, X_val, y_train, y_val = train_test_split (X, y, test_size=0.33, random_state=random_state)
    if verbose:
        print('Shape X_train:', X_train.shape)
        print('Shape y_train:', y_train.shape)
        print('Shape X_val:', X_val.shape)
        print('Shape y_val:', y_val.shape)

    # MLP requires scaling of all predictors
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_val = scaler.transform(X_val)

    n = 1
    layers1 = [5, 10]
    layers2 = [5, 10]
    if verbose:
        print ("       layers    roc-auc")
        print ("  Seq  l1  l2  train validation iters runtime")
    for l1 in layers1:
        mlp, n = do_mlp (X_train, X_val, y_train, y_val, n, l1)
        for l2 in layers2:
            mlp, n = do_mlp (X_train, X_val, y_train, y_val, n, l1, l2)

    return mlp

在两种情况下我都使用完全相同的数据(159880个观测值和15个预测值)。我对cv=3使用GridSearchCV(默认值),并在我的手工代码中使用相同比例的验证集。 在搜索可能的答案时,我发现this post on SO描述了同样的问题。没有答案。也许有人明白到底发生了什么?

感谢您的时间。

修改

我检查了GridSearchCV和KFold的代码,正如@Mohammed Kashif所建议的那样,确实发现了一个明确的评论,即KFold没有对数据进行洗牌。所以我在缩放器之前将以下代码添加到model_mlp:

np.random.seed (random_state)
index = np.random.permutation (len(X_train))
X_train = X_train.iloc[index]

并将simple_mlp替换为train_test_split:

np.random.seed (random_state)
index = np.random.permutation (len(X))
X = X.iloc[index]
y = y.iloc[index]
train_size = int (2 * len(X) / 3.0) # sample of 2 third
X_train = X[:train_size]
X_val = X[train_size:]
y_train = y[:train_size]
y_val = y[train_size:]

导致以下输出:

Train set AUC:  0.5
Grid best score (AUC):  0.501410198106
Grid best parameter (max. AUC):  {'hidden_layer_sizes': (5, 10)}

{   'mean_fit_time': array([28.62, 46.00, 54.44, 46.74, 55.25, 53.33]),
    'mean_score_time': array([ 0.04,  0.05,  0.05,  0.05,  0.05,  0.06]),
    'mean_test_score': array([ 0.50,  0.50,  0.50,  0.50,  0.50,  0.50]),
    'mean_train_score': array([ 0.50,  0.51,  0.51,  0.51,  0.50,  0.51]),
    'param_hidden_layer_sizes': masked_array(data = [5 (5, 5) (5, 10) 10 (10, 5) (10, 10)],
             mask = [False False False False False False],
       fill_value = ?)
,
    'params': [   {'hidden_layer_sizes': 5},
                  {'hidden_layer_sizes': (5, 5)},
                  {'hidden_layer_sizes': (5, 10)},
                  {'hidden_layer_sizes': 10},
                  {'hidden_layer_sizes': (10, 5)},
                  {'hidden_layer_sizes': (10, 10)}],
    'rank_test_score': array([   6,    2,    1,    4,    5,    3]),
    'split0_test_score': array([ 0.50,  0.50,  0.51,  0.50,  0.50,  0.50]),
    'split0_train_score': array([ 0.50,  0.51,  0.50,  0.51,  0.50,  0.51]),
    'split1_test_score': array([ 0.50,  0.50,  0.50,  0.50,  0.49,  0.50]),
    'split1_train_score': array([ 0.50,  0.50,  0.51,  0.50,  0.51,  0.51]),
    'split2_test_score': array([ 0.49,  0.50,  0.49,  0.50,  0.50,  0.50]),
    'split2_train_score': array([ 0.51,  0.51,  0.51,  0.51,  0.50,  0.51]),
    'std_fit_time': array([19.74, 19.33,  0.55,  0.64,  2.36,  0.65]),
    'std_score_time': array([ 0.01,  0.01,  0.00,  0.01,  0.00,  0.01]),
    'std_test_score': array([ 0.01,  0.00,  0.01,  0.00,  0.00,  0.00]),
    'std_train_score': array([ 0.00,  0.00,  0.00,  0.00,  0.00,  0.00])}

似乎证实了穆罕默德的言论。我必须说我起初对此持怀疑态度,因为我无法想象随机化对这样一个看起来不那么有序的大数据集的强烈影响。

然而,我有些疑惑。在最初的设置中,GridSearchCV持续过高约0.20,现在它一直太低了约0.05。这是一种改进,因为两种方法的偏差都减少了4倍。是否有最后一个发现的解释,或者两种方法之间的偏差大约是0.05,只是噪声的事实?我决定将此标记为正确答案,但我希望有人可以对我的小疑问有所了解。

1 个答案:

答案 0 :(得分:1)

得分的差异主要是由于GridSearchCV分割数据集的方式不同以及模拟它的函数。这样想吧。假设您的数据集中有9个数据点。现在在GridSearchCV中有3倍,假设分布是这样的:

train_cv_fold1_indices : 1 2 3 4 5 6 
test_cv_fold1_indices  : 7 8 9


train_cv_fold2_indices : 1 2 3 7 8 9 
test_cv_fold2_indices  : 4 5 6


train_cv_fold3_indices : 4 5 6 7 8 9 
test_cv_fold3_indices  : 1 2 3

但是,模拟Gr​​idSearchCV的函数可能会以不同的方式拆分数据,例如:

train_indices : 1 3 5 7 8 9
test_indices  : 2 4 6

现在,正如您所看到的,这对数据集进行了不同的拆分,因此对其进行训练的分类器可能表现得完全不同。 (它甚至可能表现相同,这一切都取决于数据点和各种其他因素,例如它们的相关性,它们是否有助于检查数据点之间的差异等)。

因此,为了完美地模拟GridSearchCV,您需要以相同的方式执行拆分。

检查GridSearchCV Source,你会发现在第592行,为了执行简历,他们会从check_cv指定的at this link调用另一个函数。它实际上会调用Kfold CVstartified CV

因此,基于您的实验,我建议使用固定的随机种子和上面提到的函数(Kfold CVstartified CV)在数据集上显式执行CV。然后在仿真函数中使用相同的CV对象以获得更具可比性的分析。然后你可能会获得更多相关的值。