如何在决策树中找到导致错误分类的规则

时间:2017-12-16 07:12:49

标签: machine-learning scikit-learn classification random-forest decision-tree

我构建了一个二元决策树分类器。从混淆矩阵中我发现类0被错误分类495次而类1被错误分类134次。我想找出决策树中哪些规则实际上导致记录错误分类。

简而言之,哪个记录在哪个树节点失败

是否有机器学习方法可用于在决策树中查找导致其错误分类的规则

混淆矩阵

[[14226   495]
 [  134  3271]]

安装决策树并绘制

cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3)) 
cv_addr = cv.fit_transform(data.pop('Clean_addr'))

for i, col in enumerate(cv.get_feature_names()):
    data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

train = data.drop(['Resi], axis=1)
Y = data['Resi']
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.3,random_state =8)
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(X_train, y_train)

    dt=DecisionTreeClassifier(class_weight="balanced", min_samples_leaf=30)
    fit_decision=dt.fit(X_train_res,y_train_res)
    from sklearn.externals.six import StringIO  
    from IPython.display import Image  
    from sklearn.tree import export_graphviz
    import pydotplus
    dot_data = StringIO()
    export_graphviz(fit_decision, out_file=dot_data,  
                    filled=True, rounded=True,
                    special_characters=True,feature_names=train.columns)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    Image(graph.create_png())from sklearn.externals.six import StringIO  
    from IPython.display import Image  
    from sklearn.tree import export_graphviz
    import pydotplus
    dot_data = StringIO()
    export_graphviz(fit_decision, out_file=dot_data,  
                    filled=True, rounded=True,
                    special_characters=True,feature_names=train.columns)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    Image(graph.create_png())

感谢任何帮助。

Dtree Plot

Decision Tree Image

数据集

https://drive.google.com/open?id=1NhXfwBIB640wJ30AyPKFnbIECCdmpyi5

Resi是目标列。使用我想要预测的其他数据列,我已经计算了Clean_addr列的数据。

0 个答案:

没有答案