如何从决策树中对功能进行取消编码以查看重要功能?

时间:2018-11-27 20:47:33

标签: python scikit-learn decision-tree encoder

我有一个正在使用的数据集。我正在将它们从分类特征转换为决策树的数字特征。转换发生在整个数据帧中,并包含以下几行:

le = LE()
df = df.apply(le.fit_transform)

我稍后将这些数据收集起来,并通过以下方式将其分为训练和测试数据:

target = ['label']
df_y = df['label']
df_x = df.drop(target, axis=1)

# Split into training and testing data
train_x, test_x, train_y, test_y = tts(df_x, df_y, test_size=0.3, random_state=42)  

然后,我将其传递给训练决策树的方法:

def Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le):
    print " - Candidate: Decision Tree Classifier"
    dec_tree_classifier = DecisionTreeClassifier(random_state=0) # Load Module
    dec_tree_classifier.fit(train_x, train_y) # Fit
    accuracy = dec_tree_classifier.score(test_x, test_y) # Acc
    predicted = dec_tree_classifier.predict(test_x)
    mse = mean_squared_error(test_y, predicted)

    tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
    print "Tree Features:"
    print tree_feat
    print "Tree Thresholds:"
    print dec_tree_classifier.tree_.threshold

    scores = cross_val_score(dec_tree_classifier, test_x, test_y.values.ravel(), cv=10)
    return (accuracy, mse, scores.mean(), scores.std())

在上述方法中,我传递了最初用于编码数据帧的LabelEncoder对象。我有线

tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

尝试将功能转换回其原始分类表示形式,但我不断收到此堆栈跟踪错误:

  File "<ipython-input-6-c2005f8661bc>", line 1, in <module>
    runfile('main.py', wdir='/Users/mydir)

  File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

  File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
    builtins.execfile(filename, *where)

  File "/Users/me/mydir/main.py", line 125, in <module>
    main()  # Run main routine

  File "candidates.py", line 175, in get_baseline
    dec_tre_acc = Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le)

  File "candidates.py", line 40, in Decision_Tree_Classifier
    tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

  File "/Users/me/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 281, in inverse_transform
    "y contains previously unseen labels: %s" % str(diff))

ValueError: y contains previously unseen labels: [-2]

我需要更改什么才能能够查看实际功能?

1 个答案:

答案 0 :(得分:1)

执行此操作时:

df = df.apply(le.fit_transform)

您正在为所有列使用单个LabelEncoder实例。当fit()称为fit_transform()le时,将忘记先前的数据,仅学习当前数据。因此,您拥有的le仅存储有关它看到的最后一列的信息,而不是所有列的信息。

有多种解决方法:

  1. 您可以维护多个LabelEncoder对象(每列一个)。在这里看到这个出色的答案:

  2. 如果要保留一个对象来处理所有列,如果安装了最新版本的scikit-learn,则可以使用OrdinalEncoder

    from sklearn.preprocessing import OrdinalEncoder
    enc = OrdinalEncoder()
    
    df = enc.fit_transform(df)
    

但是仍然无法解决错误,因为tree_.feature不对应于要素的值,而是对应于在该节点处进行分割的索引(df中的列)。因此,如果数据中有3个要素(列)(与该列中的值无关),则tree_.feature可以具有以下值:

  • 0,1,2,-2

  • -2是一个特殊的占位符值,表示该节点是叶节点,因此不使用任何功能来分割任何东西。

tree_.threshold将包含与您的数据值相对应的值。但这将是浮点数,因此您将不得不根据类别到数字的转换进行转换。

请参见以下示例,以详细了解树的结构: