决策树分类的性能好坏?

时间:2019-07-10 14:38:11

标签: machine-learning scikit-learn classification jupyter decision-tree

我有一个道路事故数据库。我需要建立一个分类决策树,以便在功能之间找到有趣的信息。我有3个分类特征-“进取心”,“班次”,“日子”-但是,当我将这些特征之一用作目标变量时,我得到的是带有数字分类的树。我希望树以没有数字或至少没有浮点值的方式处理这些分类特征。 我使用单向编码将我的分类特征拟合到树上。

1。设置我的功能 我的功能

  

features = [“ SK_Tik_Teuna”,“ Hour”,“ Year”,“ Month”,“ DriversInvolved”,“ Jewish”,“ UnknownReligon”,“ NotJewish”,“ UnknownCar”,“ Else”,“ Empty”, “ Distric”,“ Lighing”,“ Urban_NotUrban”,“ Crossroads_NotCrossroads”,“坐标”,“区域”,“ AccSeverity_A”,“ AccSeverity_B”,“ AccSeverity_C”,“ Day_D1”,“ Day_D2”,“ Day_D3” ”,“ Day_D5”,“ Day_D6”,“ Day_D6”,“ Day_D7”,“ Shift_A”,“ Shift_B”,“ Shift_C”]

2。建立决策树
决策树分类

     print("Training size: {}; Test size: {}".format(len(train),len(test)) )
#result from the line above
Training size: 2024; Test size: 998

#define the tree parameters

    c=DecisionTreeClassifier(criterion='gini',

    min_samples_leaf=5,

    min_samples_split=5,

max_depth=None,
random_state=0)
  1. 一种热门编码方式
    一键编码器以修复分类特征

    data = data.copy()
    data = pd.get_dummies(data, columns=['AccSeverity'], prefix = ['AccSeverity'])
    data = pd.get_dummies(data, columns=['Day'], prefix = ['Day'])
    data = pd.get_dummies(data, columns=['Shift'], prefix = ['Shift'])
    print(data.head())

  1. 编码后检查数据
    检查数据
  

data.info()              RangeIndex:3022个条目,0到3021       数据列(共40列):       SK_Tik_Teuna 3022非空int64       小时3022非空int64       3022年非空int64       月3022非null int64       涉及的驱动程序3022非null int64       犹太人3022非null int64       UnknownReligon 3022非null int64       NotJewish 3022非空int64       UnknownCar 3022非null int64       其他3022非null int64       Two_Third_Wheel 3022非null int64       投标3022非null int64       拖拉机3022非null int64       ATV 3022非null int64       未知3022非null int64       Cab 3022非null int64       CommercialVehicle 3022非null int64       卡车3022非空int64       PrivateCar 3022非null int64       PublicVehicle 3022非null int64       空3022非空int64       Distric 3022非null int64       Lighing 3022非null int64       Urban_NotUrban 3022非空int64       Crossroads_NotCrossroads 3022非null int64       坐标3022非null int64       区域3022非null int64       AccSeverity_A 3022非null uint8       AccSeverity_B 3022非null uint8       AccSeverity_C 3022非null uint8       Day_D1 3022非null uint8       Day_D2 3022非null uint8       Day_D3 3022非null uint8       Day_D4 3022非null uint8       Day_D5 3022非空uint8       Day_D6 3022非空uint8       Day_D7 3022非空uint8       Shift_A 3022非null uint8       Shift_B 3022非空uint8       Shift_C 3022非null uint8       dtypes:int64(27),uint8(13)

  1. 设置目标变量

x_train=train[features]
y_train=train["AccSeverity_A"]
x_test=test[features]
y_test=test["AccSeverity_A"]

#train
dt=c.fit(x_train,y_train)
  1. 树输出 建立树
    def show_tree(tree,features, path):
        f=io.StringIO()
        export_graphviz(tree,out_file=f, feature_names=features)
        pydotplus.graph_from_dot_data(f.getvalue()).write_png(path)
        img=misc.imread(path)
        plt.rcParams["figure.figsize"]=(20,20)
        plt.imshow(img)
  

#show树       #show_tree(dt,features,'dec_tree_01.png')

#predict
y_pred=c.predict(x_test)
#rsult after running the line above
y_pred
array([1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
  1, 0, 1, 0, 1, 1, 1, 1], dtype=uint8)

#accuracy finding
 from sklearn.metrics import accuracy_score
 score=accuracy_score(y_test, y_pred) *100
print("Accuracy using Desicion Tree:", round(score, 1), "%")

#result
#Accuracy using Desicion Tree: 100.0 %

1 个答案:

答案 0 :(得分:0)

您的目标变量是输入的一部分。

如果答案已经存在,您当然会得到100%...

您的树可能包含一个等效于以下内容的单个节点:

return AccSeverity_A