Question

我有一个朴素的贝叶斯分类器，该分类器是我使用Pandas数据框在Python中编写的，现在我在PySpark中需要它。我的问题是我需要每一列的功能重要性。查看PySpark ML文档时，找不到任何信息。 documentation

有人知道我可以通过朴素贝叶斯Spark MLlib获得功能重要性吗？

以下是使用Python的代码。使用 .coef _

检索功能的重要性

df = df.fillna(0).toPandas()

X_df = df.drop(['NOT_OPEN', 'unique_id'], axis = 1)
X = X_df.values
Y = df['NOT_OPEN'].values.reshape(-1,1)

mnb = BernoulliNB(fit_prior=True) 
y_pred = mnb.fit(X, Y).predict(X)
estimator = mnb.fit(X, Y)


# coef_: For a binary classification problems this is the log of the estimated probability of a feature given the positive class. It means that higher values mean more important features for the positive class.

feature_names = X_df.columns
coefs_with_fns = sorted(zip(estimator.coef_[0], feature_names))

Answer 1

如果您对等效的coef_感兴趣，那么您要寻找的属性是NaiveBayesModel.theta

类条件概率的日志。

2.0.0版中的新功能。

即

model = ...  # type: NaiveBayesModel

model.theta.toArray()  # type: numpy.ndarray

结果数组的大小为(number-of-classes, number-of-features)，并且行对应于连续的标签。

Answer 2

最好评估差异
log（P（feature_X | positive））-log（P（feature_X | negative））作为功能的重要性。

因为，我们对每个feature_X的判别力感兴趣（确保NB是一个生成模型）。极端示例：某些feature_X1在所有+和-样本中具有相同的值，因此没有判别力。因此，对于+和-样本，此特征值的概率较高，但是对数概率之差= 0。

获取功能重要性PySpark Naive Bayes分类器

2 个答案: