sklearn cross_val_score和learning_curve给出了截然不同的结果

时间:2018-09-19 12:27:20

标签: machine-learning scikit-learn

sklearn cross_val_scoresklearn learning_curve得到的结果截然不同。我自己的CV评分代码(在训练规模上进行自己的迭代,并使用StratifiedKfold支持CV,并收集DecisionTree score s)与cross_val_score非常一致。 我的学习曲线与sklearn的比较: enter image description here

我的自定义代码得分(最后一点):

[0.771, 0.829, 0.786, 0.838, 0.794, 0.779, 0.809, 0.882, 0.868, 0.809] mean 0.817 std 0.035

sklearn cross_val_scores:

[0.800, 0.814, 0.857, 0.824, 0.838, 0.809, 0.838, 0.809, 0.853, 0.779] mean 0.822 std 0.025

sklearn learning_curve:

[0.789, 0.644, 0.944, 0.955, 0.843, 0.618, 0.697, 0.921, 0.584, 0.864] mean 0.786 std 0.133

我可以接受我可能已经搞砸了我的代码,但是learning curve的分数遍地都是,并且比cross_val_scores更低,差异更大。诚然,曲线更像是教科书的平滑……

我将相同的DecisionTree通过固定的random_statemax_leaf_nodes=7传递给所有人。我在学习曲线函数中使用带有StratifiedKFold的random_state来确保可重复性。

有人知道为什么两个sklearn函数之间存在分歧吗?

按要求编辑显示代码

dt = DecisionTreeClassifier(splitter='random', random_state=31, max_leaf_nodes=7)
# cross_val_score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=dt, X=X_train, y=y_train, cv=10, n_jobs=1)

# using sklearn scoring
from sklearn.model_selection import learning_curve
z=len(X_train)
    sz=[int(z*i/100) for i in range(5, 90, 3)]
#was this:
#    sz, tr, va = learning_curve(dt, X y, train_sizes=sz, cv=10, n_jobs=1)

sz, tr, va = learning_curve(dt, X_train, y_train, train_sizes=sz, cv=10, n_jobs=1)


# homebrew solution
def dt_learning(X, y, dt, rs=0):
    # generates learning curve for sizes from 5% to 79% of the data X
    tr_scores={}
    va_scores={}
    te_scores=[]
    size=[]
    for i in range(5, 80, 3):
        # new Training set from 5% to 80%
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1-(i/100), random_state=rs, stratify=y)
        sz = len(y_train)
        size.append(sz)
        # new CV sets
        kfold = StratifiedKFold(n_splits=10, random_state=rs).split(X_train,y_train)
        cv_sets=[]
        for (train,val) in kfold:
            cv_sets.append([train,val])

        # train classifier using CV and record accuracy scores
        tr_scores[sz]=[]
        va_scores[sz]=[]
        for k, (train,test) in enumerate(cv_sets):
            dt.fit(X_train[train], y_train[train])
            tr_scores[sz].append(dt.score(X_train[train], y_train[train]))
            va_scores[sz].append(dt.score(X_train[test], y_train[test]))
    return size, tr_scores, va_scores, te_scores
# run homebrew learning curve
size, tr_scores, va_scores, te_scores = dt_learning(X, y, dt, rs=7)

1 个答案:

答案 0 :(得分:-1)

啊,我正在将所有X,y数据(训练+测试集)传递给learning_curve,而其他两个分数仅查看X_train,y_train)-因此它正在比较不同的数据集,将解释不同的结果。

现在已修复(添加到原始问题中的代码显示注释了不正确的命令,并替换为提供X_train,y_train的行)。

现在,图learning_curve的得分和变化与其他方法相似。

所以,这只是一个编码错误!