与sklearn cross_val_score
和sklearn learning_curve
得到的结果截然不同。我自己的CV评分代码(在训练规模上进行自己的迭代,并使用StratifiedKfold
支持CV,并收集DecisionTree score
s)与cross_val_score
非常一致。
我的学习曲线与sklearn的比较:
我的自定义代码得分(最后一点):
[0.771, 0.829, 0.786, 0.838, 0.794, 0.779, 0.809, 0.882, 0.868, 0.809] mean 0.817 std 0.035
sklearn cross_val_scores:
[0.800, 0.814, 0.857, 0.824, 0.838, 0.809, 0.838, 0.809, 0.853, 0.779] mean 0.822 std 0.025
sklearn learning_curve:
[0.789, 0.644, 0.944, 0.955, 0.843, 0.618, 0.697, 0.921, 0.584, 0.864] mean 0.786 std 0.133
我可以接受我可能已经搞砸了我的代码,但是learning curve
的分数遍地都是,并且比cross_val_scores
更低,差异更大。诚然,曲线更像是教科书的平滑……
我将相同的DecisionTree通过固定的random_state
和max_leaf_nodes=7
传递给所有人。我在学习曲线函数中使用带有StratifiedKFold
的random_state来确保可重复性。
有人知道为什么两个sklearn函数之间存在分歧吗?
按要求编辑显示代码
dt = DecisionTreeClassifier(splitter='random', random_state=31, max_leaf_nodes=7)
# cross_val_score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=dt, X=X_train, y=y_train, cv=10, n_jobs=1)
# using sklearn scoring
from sklearn.model_selection import learning_curve
z=len(X_train)
sz=[int(z*i/100) for i in range(5, 90, 3)]
#was this:
# sz, tr, va = learning_curve(dt, X y, train_sizes=sz, cv=10, n_jobs=1)
sz, tr, va = learning_curve(dt, X_train, y_train, train_sizes=sz, cv=10, n_jobs=1)
# homebrew solution
def dt_learning(X, y, dt, rs=0):
# generates learning curve for sizes from 5% to 79% of the data X
tr_scores={}
va_scores={}
te_scores=[]
size=[]
for i in range(5, 80, 3):
# new Training set from 5% to 80%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1-(i/100), random_state=rs, stratify=y)
sz = len(y_train)
size.append(sz)
# new CV sets
kfold = StratifiedKFold(n_splits=10, random_state=rs).split(X_train,y_train)
cv_sets=[]
for (train,val) in kfold:
cv_sets.append([train,val])
# train classifier using CV and record accuracy scores
tr_scores[sz]=[]
va_scores[sz]=[]
for k, (train,test) in enumerate(cv_sets):
dt.fit(X_train[train], y_train[train])
tr_scores[sz].append(dt.score(X_train[train], y_train[train]))
va_scores[sz].append(dt.score(X_train[test], y_train[test]))
return size, tr_scores, va_scores, te_scores
# run homebrew learning curve
size, tr_scores, va_scores, te_scores = dt_learning(X, y, dt, rs=7)
答案 0 :(得分:-1)
啊,我正在将所有X,y数据(训练+测试集)传递给learning_curve
,而其他两个分数仅查看X_train,y_train)-因此它正在比较不同的数据集,将解释不同的结果。
现在已修复(添加到原始问题中的代码显示注释了不正确的命令,并替换为提供X_train,y_train的行)。
现在,图learning_curve
的得分和变化与其他方法相似。
所以,这只是一个编码错误!