Question

我正在使用sklearn的树包分析我的决策树模型的训练错误和验证错误。

#compute the rms error
def compute_error(x, y, model):
 yfit = model.predict(x.toarray())
 return np.mean(y != yfit) 

def drawLearningCurve(model,xTrain, yTrain, xTest, yTest):
 sizes = np.linspace(2, 25000, 50).astype(int)
 train_error = np.zeros(sizes.shape)
 crossval_error = np.zeros(sizes.shape)

 for i,size in enumerate(sizes):

  model = model.fit(xTrain[:size,:].toarray(),yTrain[:size])

  #compute the validation error
  crossval_error[i] = compute_error(xTest,yTest,model)

  #compute the training error
  train_error[i] = compute_error(xTrain[:size,:],yTrain[:size],model)

from sklearn import tree
clf = tree.DecisionTreeClassifier()
drawLearningCurve(clf, xtr, ytr, xte, yte)

问题是（我不知道是否有问题）如果我将决策树作为模型提供给函数drawLearningCurve，我会收到训练错误的结果为0.0在每个循环中。它与我的数据集的性质有关，还是与sklearn的树包有关？或者还有其他错误吗？

PS：其他型号的训练错误绝对不是0.0，如naive-bayes，knn或ann。

Answer 1

这些表彰给出了一些非常有用的指示。我想添加您可能想要调整的参数称为max_depth。

让我更担心的是你的compute_error功能是奇怪的。您收到0错误的事实表明您的分类器在训练集上没有出错。但是，如果它确实犯了任何错误，你的错误功能就不会告诉你。

import numpy as np
np.mean([0,0,0,0] != [0,0,0,0]) # perfect match, error is 0
0.0

np.mean([0,0,0,0] != [1, 1, 1, 1]) # 100% wrong answers
1.0

np.mean([0,0,0,0] != [1, 1, 1, 0]) # 75% wrong answers
1.0

np.mean([0,0,0,0] != [1, 1, 0, 0]) # 50% wrong answers
1.0

np.mean([0,0,0,0] != [1, 1, 2, 2]) # 50% wrong answers
1.0

你想要的是np.sum(y != yfit)，甚至更好，是sklearn附带的错误函数之一，例如accuracy_score。

SkLearn决策树：过度拟合还是Bug？

1 个答案: