使用id3算法训练决策树sklearn

时间:2018-01-28 09:14:41

标签: scikit-learn python-3.5 decision-tree cross-validation confusion-matrix

我正在尝试使用id3算法训练决策树。 目的是获取所选特征的索引,估算出事件的位置,并建立一个完整的混淆矩阵。

算法应该将数据集拆分为训练集和测试集,并使用4次交叉验证。

我是这个主题的新手,我已经阅读了关于学习过程的关于sklearn和理论的教程,但我仍然非常困惑。

我尝试过做的事情:

from sklearn.model_selection import cross_val_predict,KFold,cross_val_score, 
train_test_split, learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix


X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train,y_train)
results = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))
y_pred = cross_val_predict(estimator=clf, X=x, y=y, cv=4)
conf_mat = confusion_matrix(y,y_pred)
print(conf_mat)
dot_data = tree.export_graphviz(clf, out_file='tree.dot') 

我有一些问题:

  1. 如何获取培训中使用的功能索引列表?我是否必须通过clf中的树?无法找到任何api方法来检索它们。

  2. 我是否必须使用' fit',' cross_val_score'和' cross_val_predict'?似乎他们所有人都在做某种学习过程,但我无法通过其中一个来设置clf,准确性和confusuin矩阵。

  3. 我是否必须使用测试集进行估算或数据集折叠的分区?

1 个答案:

答案 0 :(得分:3)

  1. 要检索培训过程中使用的功能列表,您只需以这种方式从x获取列:

    feature_list = x.columns

    正如您所知,并非每个功能都可用于预测。在训练模型后,您可以使用

    看到这一点

    clf.feature_importances_

    feature_list中要素的索引与feature_importances列表中的相同。

  2. 如果您使用交叉验证,则无法立即检索分数 cross_val_score完成了交易,但更好的方法是让分数可以使用cross_validate。它的工作方式与cross_val_score相同,但您可以使用make_score创建所需的每个分数并传递更多分数值,这里是一个示例:

    from sklearn.model_selection import train_test_split,  cross_validate
    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score 
    import pandas as pd, numpy as np       
    
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    dtc = DecisionTreeClassifier()
    dtc_fit = dtc.fit(x_train, y_train)
    
    def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
    def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
    def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
    def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
    
    scoring = {
        'tp' : make_scorer(tp), 
        'tn' : make_scorer(tn), 
        'fp' : make_scorer(fp), 
        'fn' : make_scorer(fn), 
        'accuracy' : make_scorer(accuracy_score),
        'precision': make_scorer(precision_score),
        'f1_score' : make_scorer(f1_score),
        'recall'   : make_scorer(recall_score)
    }
    
    sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring)
    
    print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2))
    print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2))
    print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2))
    print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n")
    
    stp = math.ceil(sc['test_tp'].mean())
    stn = math.ceil(sc['test_tn'].mean())
    sfp = math.ceil(sc['test_fp'].mean())
    sfn = math.ceil(sc['test_fn'].mean())
    
    confusion_matrix = pd.DataFrame(
        [[stn, sfp], [sfn, stp]],
        columns=['Predicted 0', 'Predicted 1'],
        index=['True 0', 'True 1']
    )
    print(conf_m)
    
  3. 使用cross_val函数时,函数本身会为测试和训练创建折叠。如果您想管理火车折叠和测试折叠,您可以使用K_Fold类自己完成 如果您需要保持课程平衡,需要通过DecisionTreeClassifier获得良好的评分,您必须使用StratifiedKFold。如果要随机随机播放折叠中包含的值,可以使用StratifiedShuffleSplit。这是一个例子:

    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import StratifiedShuffleSplit
    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
    import pandas as pd, numpy as np
    
    precision = []; recall = []; f1score = []; accuracy = []
    
    sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)    
    dtc = DecisionTreeClassifier()
    
    for train_index, test_index in sss.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
        dtc.fit(X_train, y_train)
        pred = dtc.predict(X_test)
    
        precision.append(precision_score(y_test, pred))
        recall.append(recall_score(y_test, pred))
        f1score.append(f1_score(y_test, pred))
        accuracy.append(accuracy_score(y_test, pred))   
    
    print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2))
    print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2))
    print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2))
    print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))
    
  4. 我希望我已经回答了你需要的一切!