SciKit-Learn:交叉验证的结果非常不同

时间:2017-07-17 15:35:16

标签: python scikit-learn cross-validation

我正在使用SciKit-Learn 0.18.1和Python 2.7进行一些基本的机器学习。我试图通过交叉验证来评估我的模型有多好。当我这样做时:

from sklearn.cross_validation import cross_val_score, KFold

cv = KFold(n=5, random_state = 100)

clf = RandomForestRegressor(n_estimators=400, max_features = 0.5, verbose = 2, max_depth=30, min_samples_leaf=3)
score = cross_val_score(estimator = clf, X = X, y = y, cv = cv, n_jobs = -1, 
                        scoring = "neg_mean_squared_error")
avg_score = np.mean([np.sqrt(-x) for x in score])
std_dev = y.std()
print "avg_score: {}, std_dev: {}, avg_score/std_dev: {}".format(avg_score, std_dev, avg_score/std_dev)

我得到一个低avg_score(~9K)。

令人不安的是,尽管指定了5倍,我的score数组中只有3个项目。相反,当我这样做时:

from sklearn.model_selection import KFold, cross_val_score

并运行相同的代码(n成为n_splits除外),我的RMSE(~24K)更差。

知道这里发生了什么吗?

谢谢!

1 个答案:

答案 0 :(得分:1)

cv = KFold(n=5, random_state = 100)

根据http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html#sklearn.model_selection.KFold n是示例的总数,n_folds(默认为3)是CV折叠的数量。您似乎只运行了3次折叠和5次示例的CV,这可能导致出现差异。 也许将n更改为n_folds