数据已经分为训练，简历，测试。

Question

我正在尝试对数据集进行多项式NB分类，并试图获得正负类的最重要特征。

我已经完成了以下工作。

数据已经分为训练，简历，测试。

一种对数字特征进行归类和归一化的热编码
使用其他几个功能对单词进行编码的包
使用hstack堆叠了所有这些，所以现在我们有了x_train，x_cv，x_test和数据矩阵（78743，22402）（78743，）（38785，22402）（38785，）（57888，22402）（57888，）
我已经使用GridSearchCV并绘制了ROC曲线以找到最佳的alpha。
然后我绘制了一些误差图并得到了混淆矩阵等。
现在，我正尝试从正面和负面类别中提取前10个特征。

我已使用以下代码将其提取

class_labels = nb.classes_
feature_names =vectorizer_bow.get_feature_names()
top_neg_class = sorted(zip(nb.feature_count_[0], feature_names),reverse=True)[:10]
top_pos_class = sorted(zip(nb.feature_count_[1], feature_names),reverse=True)[:10]

print("Important words in positive reviews")
for coef, feature in top_pos_class:
    print(class_labels[1], coef, feature)
print("Important words in negative reviews")
for coef, feature in top_neg_class:
    print(class_labels[0], coef, feature)

上面以以下格式给了我结果：

Important words in positive reviews
1 22048.0 active
1 20245.0 after
1 16905.0 adding
1 15594.0 adventures
1 14733.0 actively
1 14272.0 actors
1 6827.0 feeding
1 6527.0 soft
1 6511.0 wide
1 6367.0 adaptive

我无法比较并确定那些真的是前十名。

我正在尝试以其他方式尝试此操作，但无法实现。因此，我正在寻找以以下方式实现逻辑的帮助

1。需要获取最大索引值。 2.我们已经知道堆叠的功能及其尺寸。需要检查尺寸的单词并打印该单词。

假设我们有文字+文章+类别让文字的尺寸为120，文章120，类别10 以这种方式进行hstack（文本+文章+分类） 120 + 120 + 10 = 250尺寸令重要词索引为10,196,243。用于在文本中搜索索引10，在论文中搜索196，在分类中搜索243

使用HSTACK的多项式NaiveBayes热门功能选择

数据已经分为训练，简历，测试。

0 个答案: