我试图通过scikit-learn交叉验证我的分数,并且我遇到了一个奇怪的问题,其中"手动"创建一个Stratified Shuffle Loop比使用内置的cross_val_score要好得多。
import pandas as pd
import numpy as np
import cPickle
import helper_functions
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import make_scorer
rf_clf = RandomForestRegressor(n_estimators=5)
with open("../../stashed_dims.pkl", 'rb') as fout:
[TRAIN_X, TRAIN_Y, TEST_X, test_index] = cPickle.load(fout)
N_CV = 1
sss = StratifiedShuffleSplit(TRAIN_Y, N_CV, test_size=0.25, random_state=0)
for iterations, [local_train_index, local_test_index] in enumerate(sss):
X_train, X_test = TRAIN_X[local_train_index], TRAIN_X[local_test_index]
y_train, y_test = TRAIN_Y[local_train_index], TRAIN_Y[local_test_index]
rf_clf.fit(X_train, y_train)
pred = rf_clf.predict(X_test)
print("Stratified Shuffle Split method 1")
print(helper_functions.get_score(pred, y_test))
scorer = make_scorer(helper_functions.get_score)
scores = cross_val_score(rf_clf, TRAIN_X, TRAIN_Y, cv = sss, scoring = scorer, verbose = 10)
print("Stratified Shuffle Split method 2")
print(scores)
我不知道这两种方法之间的区别是什么。有什么想法吗?
StratifiedShuffleSplit documentation
cross_val_score documentation
答案 0 :(得分:1)
如果没有完整的代码(没有给出)很难说,但是,至少从这段代码中看,你似乎没有使用相同的评分函数。
明确的:
print(helper_functions.get_score(pred, y_test))
隐式:
scores = cross_val_score(... scoring = scorer ...)
答案 1 :(得分:0)
在这个顺序中找到我的答案对我的评分功能很重要。
foo(y_true,y_pred)!= foo(y_pred,y_true)这个得分函数。