为什么Cross_Val_Score与Stratified Shuffle Split差异很大?

时间:2016-01-24 06:35:20

标签: scikit-learn

我试图通过scikit-learn交叉验证我的分数,并且我遇到了一个奇怪的问题,其中"手动"创建一个Stratified Shuffle Loop比使用内置的cross_val_score要好得多。

import pandas as pd
import numpy as np
import cPickle

import helper_functions

from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import StratifiedShuffleSplit

from sklearn.cross_validation import cross_val_score
from sklearn.metrics import make_scorer

rf_clf = RandomForestRegressor(n_estimators=5)

with open("../../stashed_dims.pkl", 'rb') as fout:
    [TRAIN_X, TRAIN_Y, TEST_X, test_index] = cPickle.load(fout)


N_CV = 1
sss = StratifiedShuffleSplit(TRAIN_Y, N_CV, test_size=0.25, random_state=0)

for iterations, [local_train_index, local_test_index] in enumerate(sss):
    X_train, X_test = TRAIN_X[local_train_index], TRAIN_X[local_test_index]
    y_train, y_test = TRAIN_Y[local_train_index], TRAIN_Y[local_test_index]

    rf_clf.fit(X_train, y_train)
    pred = rf_clf.predict(X_test)

    print("Stratified Shuffle Split method 1")
    print(helper_functions.get_score(pred, y_test))

scorer = make_scorer(helper_functions.get_score)
scores = cross_val_score(rf_clf, TRAIN_X, TRAIN_Y, cv = sss, scoring = scorer, verbose = 10)
print("Stratified Shuffle Split method 2")
print(scores)

Screenshot

我不知道这两种方法之间的区别是什么。有什么想法吗?

StratifiedShuffleSplit documentation
cross_val_score documentation

2 个答案:

答案 0 :(得分:1)

如果没有完整的代码(没有给出)很难说,但是,至少从这段代码中看,你似乎没有使用相同的评分函数。

明确的:

print(helper_functions.get_score(pred, y_test))

隐式:

scores = cross_val_score(... scoring = scorer ...)

答案 1 :(得分:0)

在这个顺序中找到我的答案对我的评分功能很重要。

foo(y_true,y_pred)!= foo(y_pred,y_true)这个得分函数。