使用决策树并使用交叉验证;我正在重新创建树n次以寻找最佳深度,但在每个深度级别(1-20),尽管通过交叉验证分割训练数据并更改树深度以尝试避免过度拟合,但我仍然返回100%的准确度。代码如下所示,此处的数据为https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation
from sklearn.cross_validation import KFold
features = ['birad','age','Shape','margin','density','severity']
df = pd.read_csv('mammographic_masses.data',header=None,names=features)
df= df[df.birad != '?']
df= df[df.age != '?']
df= df[df.Shape != '?']
df= df[df.margin != '?']
df= df[df.density != '?']
x = df[features][:-1]
y = df['severity'][:-1]
x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)
depth = []
best_depth = 3
best_score = 0
for i in range(1,20):
clf = tree.DecisionTreeClassifier(max_depth=i)
clf = clf.fit(x_train,y_train)
scores = cross_validation.cross_val_score(clf,x_train,y_train,cv=10)
ascore = clf.score(x_test,y_test)
#print "DEPTH = ",i
#print "PSCORES = ",sum(scores)/float(len(scores))
#print "ASCORE = ",ascore
#print
depth.append((i,clf.score(x_test,y_test)))
if ascore > best_score:
best_score,best_depth = ascore,i
print best_depth,' ',best_score
答案 0 :(得分:0)
我的代码中发现了一个错误,解决了这个问题。我写的是
x = df[features][:-1]
我应该写的是,
x = df[features[:-1]]
换句话说,我的输入包括目标/结果col,这就是我获得完美准确性的原因