交叉验证和模型选择

时间:2016-02-15 14:23:34

标签: python numpy machine-learning scikit-learn cross-validation

我正在使用skilearn进行SVM培训。我正在使用交叉验证来评估估算器并避免过度拟合模型。

我将数据分成两部分。训练数据和测试数据。这是代码:

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0
)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=5)
print scores

# Now I need to evaluate the estimator *clf* on X_test.
clf.score(X_test,y_test)
# here,  I get an error say that the model is not fitted using fit(), but normally,
# in cross_val_score function the model is fitted? What is the problem?

1 个答案:

答案 0 :(得分:7)

cross_val_score基本上是sklearn cross-validation iterators的便利包装器。你给它一个分类器和你的整个(训练+验证)数据集,它会自动执行一轮或多轮交叉验证,将你的数据分成随机训练/验证集,拟合训练集,并计算验证集上的分数。有关示例和更多说明,请参阅文档here

clf.score(X_test, y_test)引发异常的原因是cross_val_score在估算工具的副本上执行拟合,而不是原始版本(请参阅{{1}的使用在源代码here中)。因此,clone(estimator)在函数调用之外保持不变,因此在调用clf时未正确初始化。