理解算法 - 多项式朴素贝叶斯

时间:2016-04-25 08:09:27

标签: java algorithm naivebayes

我已经介绍了Naive Bayes分类方法(Multinomial NB),参考了Michael Sipser在他的着作"计算理论"中如何描述它。

我正在研究针对训练和应用多项NB所描述的算法,如下所示:

enter image description here

但是,在解释算法的某些方面时,我会感到茫然。例如,在第6行的 TRAINMULTINOMIALNB(C,D)中:

  • 究竟 CONCATENATE_TEXT_OF_ALL_DOCS_IN_CLASS(D,C)是做什么的?

到目前为止,我理解如下。假设我们在课堂上有三到三个文件"电影"和"歌曲":

MOVIES
    DOC1 = "big fish"
    DOC2 = "big lebowski"
    DOC3 = "mystic river"

SONGS
    DOC1 = "purple rain"
    DOC2 = "crying in the rain"
    DOC3 = "anaconda"    

应用 CONCATENATE_TEXT_OF_ALL_DOCS_IN_CLASS(D,C)之后,您是否会留下,比如字符串:

String concatenatedMovies = "big fish big lebowski mystic river" 
String concatenatedSongs = "purple rain crying in the rain anaconda" 

这是对的吗? 任何帮助理解这一点是非常感谢的。

1 个答案:

答案 0 :(得分:1)

In the end, you want to be able to clasify some text based on content. So you want to be able to say if its Songs or Movies, etc.
In order to do that with Bayes (or other method), you first use your train data to build a model.

First, by creating priors (docs in class / total docs) on line 5. Than you compute conditional probabilities (probability of word fish given the class MOVIES, probability of word rain given the class SONGS), lines 7-10. You simply divide the occurences of the term with the total number of terms in class (plus some smoothing -> +1). That is why you concatinate - to be able to count all occurences of a term in a class.
In the end, you plug these values in Bayes formula and can categorize some unknonw document as MOVIES, SONGS, ... More wiki