Question

在机器学习方面，我是初学者，而且我无法解释我从第一个程序中得到的一些结果。这是设置：

我有一个书评的数据集。这些书籍可以用大约1600套的任意数量的限定词来标记。审阅这些书籍的人也可以用这些限定词标记自己，以表明他们喜欢阅读带有该标签的内容。

数据集的每个限定符都有一列。对于每次审阅，如果使用给定限定符来标记书籍和审阅者，则记录值1。如果给定评论中给定限定符没有“匹配”，则记录值为0.

还有一个“分数”列，每个评论保留一个整数1-5（该评论的“星级”）。我的目标是确定哪些功能对获得高分非常重要。

这是我现在的代码（https://gist.github.com/souldeux/99f71087c712c48e50b7）：

def determine_feature_importance(df):
    #Determines the importance of individual features within a dataframe
    #Grab header for all feature values excluding score & ids
    features_list = df.columns.values[4::]
    print "Features List: \n", features_list

    #set X equal to all feature values, excluding Score & ID fields
    X = df.values[:,4::]

    #set y equal to all Score values
    y = df.values[:,0]

    #fit a random forest with near-default paramaters to determine feature importance
    print '\nCreating Random Forest Classifier...\n'
    forest = RandomForestClassifier(oob_score=True, n_estimators=10000)
    print '\nFitting Random Forest Classifier...\n'
    forest.fit(X,y)
    feature_importance = forest.feature_importances_
    print feature_importance

    #Make importances relative to maximum importance
    print "\nMaximum feature importance is currently: ", feature_importance.max()
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    print "\nNormalized feature importance: \n", feature_importance
    print "\nNormalized maximum feature importance: \n", feature_importance.max()
    print "\nTo do: set fi_threshold == max?"
    print "\nTesting: setting fi_threshhold == 1"
    fi_threshold=1

    #get indicies of all features over fi_threshold
    important_idx = np.where(feature_importance > fi_threshold)[0]
    print "\nRetrieved important_idx: ", important_idx

    #create a list of all feature names above fi_threshold
    important_features = features_list[important_idx]
    print "\n", important_features.shape[0], "Important features(>", fi_threshold, "% of max importance:\n", important_features

    #get sorted indices of important features
    sorted_idx = np.argsort(feature_importance[important_idx])[::-1]
    print "\nFeatures sorted by importance (DESC):\n", important_features[sorted_idx]

    #generate plot
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1,2,2)
    plt.barh(pos,feature_importance[important_idx][sorted_idx[::-1]],align='center')
    plt.yticks(pos, important_features[sorted_idx[::-1]])
    plt.xlabel('Relative importance')
    plt.ylabel('Variable importance')
    plt.draw()
    plt.show()

    X = X[:, important_idx][:, sorted_idx]


    return "Feature importance determined"

我成功地制作了一个情节，但老实说我不确定情节意味着什么。据我了解，这向我展示了任何给定特征对得分变量的影响程度。但是，我意识到这一定是一个愚蠢的问题，我如何知道影响是积极还是消极？

Answer 1

简而言之你没有。决策树（随机森林的构建块）不能以这种方式工作。如果你使用线性模型，那么如果特征是＆＃34;肯定＆＃34;则有非常简单的区别。或者＆＃34;否定＆＃34;，因为它对最终结果的唯一影响是被添加（有重量）。而已。然而，决策树的集合可以为每个特征设置任意复杂的规则，例如＆＃34;如果书有红色封面并且有超过100页，那么如果它包含龙，它会获得高分＆＃34;但是＆＃34;如果书有蓝色封面和超过100页，那么如果它包含龙，它会得到低分＆＃34;等等。

功能重要性只会为您提供哪些功能有助于决策，而不是＆＃34;哪种方式＆＃34;，因为有时候它会起作用，有时会反过来。

你能做什么？您可以添加一些极端简化 - 假设您只对完全没有其他功能感兴趣，现在 - 一旦您知道哪些是重要的，您就可以计算每个类中此功能的次数（在您的情况下为分数）。这样您就可以获得分发

P(gets score X|has feature Y)

如果有（边缘化后）正面或负面影响，它会或多或少地显示出来。

Answer 2

随机森林可以衡量任何特征在分类任务中的相对重要性。

通常，如果我们失去该特征的真实值，我们会测量将要完成的损失。一次一个特征是值被扰乱并且测量预测精度的损失。

因为每次我们构建一个新的决策树并且随机森林由几棵树组成时，这些值都是可靠的。

看看这个page.

从forest.feature_importances_返回的数字越高意味着它们在此分类任务中更为重要。

但是在你的情况下，这是不合适的。我建议在训练后尝试Multinomial Naive Bayes Classifier并检查feature_log_prob_。通过这种方式，您可以看到给定类的特征概率P（x_i | y）。

解释来自RandomForestClassifier的特征重要性值

2 个答案: