Question

使用决策树并使用交叉验证;我正在重新创建树n次以寻找最佳深度，但在每个深度级别（1-20），尽管通过交叉验证分割训练数据并更改树深度以尝试避免过度拟合，但我仍然返回100％的准确度。代码如下所示，此处的数据为https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation
from sklearn.cross_validation import KFold

features = ['birad','age','Shape','margin','density','severity']

df = pd.read_csv('mammographic_masses.data',header=None,names=features)

df= df[df.birad != '?']
df= df[df.age != '?']
df= df[df.Shape != '?']
df= df[df.margin != '?']
df= df[df.density != '?']

x = df[features][:-1]
y = df['severity'][:-1]

x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)

depth = []
best_depth = 3
best_score = 0
for i in range(1,20):
  clf = tree.DecisionTreeClassifier(max_depth=i)
  clf = clf.fit(x_train,y_train)
  scores = cross_validation.cross_val_score(clf,x_train,y_train,cv=10)
  ascore = clf.score(x_test,y_test)
  #print "DEPTH = ",i
  #print "PSCORES = ",sum(scores)/float(len(scores))
  #print "ASCORE = ",ascore
  #print
  depth.append((i,clf.score(x_test,y_test)))
  if ascore > best_score:
        best_score,best_depth = ascore,i
print best_depth,' ',best_score

Answer 1

我的代码中发现了一个错误，解决了这个问题。我写的是

x = df[features][:-1]

我应该写的是，

x = df[features[:-1]]

换句话说，我的输入包括目标/结果col，这就是我获得完美准确性的原因

决策树总能恢复完美的准确性

1 个答案: