Question

根据我对该查询的搜索，我在这里发帖，我有很多提出解决方案的链接，但还没有提到如何做到这一点。例如，我已经探索了以下链接：

等。

因此，我正在理解如何在这里使用带有tf-idf的Naive Bayes公式，它如下：

Naive-Bayes公式：

P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))

tf-idf加权可以在上面的公式中使用：

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class) 

total_unique_words_in_all_classes : as is.

此问题已在堆栈溢出时多次发布，但到目前为止尚无实质性问题。我想知道我正在考虑问题的方式是否正确，即我上面已经说明的实现。我需要知道这一点，因为我自己实现了朴素贝叶斯，而没有得到任何带有Naive Bayes和tf-idf的内置函数的Python库的帮助。我真正想要的是提高使用Naive Bayes训练分类器的模型的准确度（目前为30％）。因此，如果有更好的方法来达到良好的准确性，欢迎提出建议。

请建议我。我是这个领域的新手。

Answer 1

如果你真的给了我们你想要使用的确切功能和类别，或者至少举个例子，那会更好。由于没有具体给出，我只假设以下是你的问题：

您有许多文件，每个文件都有多个单词。
您希望将文档分类。
您的要素向量由所有文档中的所有可能单词组成，并且包含每个文档中的计数数量。

您的解决方案

您提供的tf idf如下：

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)

你的方法听起来很合理。所有概率的总和将总和为1，与tf-idf函数无关，并且这些特征将反映tf-idf值。我想说这看起来像是将tf-idf合并到NB中的可靠方法。

另一种可能的解决方案

我花了一些时间来解决这个问题。其主要原因是担心维持概率正常化。使用高斯朴素贝叶斯将有助于完全忽略这个问题。

如果您想使用此方法：

计算平均值，每个类的tf-idf值的变化。
使用由上述均值和变化生成的高斯分布计算先验分布。
继续正常（乘以前）并预测值。

硬编码这不应该太难，因为numpy固有地具有高斯函数。我只是喜欢这种类型的通用解决方案来解决这些问题。

增加的其他方法

除上述内容外，您还可以使用以下技术来提高准确性：

预处理：
1. 降低功能（通常是NMF，PCA或LDA）
2. 其他功能
算法：

朴素的贝叶斯速度很快，但本质上比其他算法表现更差。最好执行特征缩减，然后切换到辨别模型，如SVM或Logistic回归
杂项

引导，提升等等。小心不要过度装备......

希望这很有帮助。如果有任何不清楚的地方发表评论

Answer 2

P（字|类别）= (word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes （基本上是整个训练集中词汇的词汇））

这总结为1？如果使用上述条件概率，我假设SUM是

P（word1 | class）+ P（word2 | class）+ ... + P（wordn | class）= （total_words_in_class + total_unique_words_in_class）/（total_words_in_class + total_unique_words_in_all_classes）

为了纠正这个问题，我认为P（word | class）应该是

(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))

如果我错了，请纠正我。

Answer 3

我认为有两种方法可以做到：

将tf-idf向下舍入为整数，然后将多项式分布用于条件概率。参见本文https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf。
使用Dirichlet分布，它是条件概率的多项式分布的连续版本。

我不确定高斯混合是否会更好。

如何使用朴素贝叶斯的tf-idf？

3 个答案:

您的解决方案

另一种可能的解决方案

增加的其他方法