Question

我认为我或多或少地了解了朴素贝叶斯，但我对一个简单的二进制文本分类的实现有一些疑问。

假设文档D_i是词汇表的一部分x_1, x_2, ...x_n

有两个类c_i任何文档都可以使用，我想为某些输入文档D计算P(c_i|D)，这与P(D|c_i)P(c_i)成比例

我有三个问题

P(c_i)是#docs in c_i/ #total docs或#words in c_i/ #total words
P(x_j|c_i)应该是#times x_j appears in D/ #times x_j appears in c_i
假设训练集中不存在x_j，我是否给它一个1的概率，这样它就不会改变计算？

例如，让我们说我有一套训练集：

training = [("hello world", "good")
            ("bye world", "bad")]

所以课程会有

good_class = {"hello": 1, "world": 1}
bad_class = {"bye":1, "world:1"}
all = {"hello": 1, "world": 2, "bye":1}

所以现在如果我想计算测试字符串的概率

test1 = ["hello", "again"]
p_good = sum(good_class.values())/sum(all.values())
p_hello_good = good_class["hello"]/all["hello"]
p_again_good = 1 # because "again" doesn't exist in our training set

p_test1_good = p_good * p_hello_good * p_again_good

Answer 1

由于这个问题太宽泛，所以我只能以限制的方式回答： -

第一名： - P（c_i）是c_i / #total docs中的#docs或c_i / #total words中的#words

P(c_i) = #c_i/#total docs

第二名： - 如果P（x_j | c_i）是#times x_j出现在D /＃中x_j出现在c_i中。
@larsmans 注意到后..

It is exactly occurrence of word in a document
by total number of words in that class in whole dataset.

第3名： - 假设训练集中不存在x_j，我是否给它一个1的概率，这样它就不会改变计算？

For That we have laplace correction or Additive smoothing. It is applied on
p(x_j|c_i)=(#times x_j appears in D+1)/ (#times x_j +|V|) which will neutralize
the effect not occurring features.

基本概念：朴素贝叶斯算法进行分类

1 个答案: