为Twitter数据实现朴素贝叶斯

时间:2015-04-17 23:20:49

标签: python scikit-learn

我有一组推文数据,其中包含与疫苗感知相关的关键字。这些包括像

这样的词

[jab, shot, measles, MMR, vaccine, autism,...]

我希望能够将新推文分类为Pro-vaccine,anti-vaccine或者两者都没有。我知道Naive Bayes就是这样做的一种方式。

我宁愿使用SKlearns库来实现分类算法,因为那些算法比我能写的更强大。

如何实施朴素贝叶斯?从Sklearn的网站来看,似乎我的选择是多项式和高斯式的,但我不确定使用哪种。

1 个答案:

答案 0 :(得分:1)

以下是对5种疾病进行分类的分类器的简单实现。

它有两个文件:

  1. 训练档案(train.txt)

  2. 测试文件(test.txt)

  3. 基本上,根据您的问题,您应该在Train文件中包含您的推文。您要在Test文件中分类的推文。

    [注意:您还可以使用CSV或JSON表示来加载数据集,为了简单起见,我使用了文本文件。]

    列车档案内容:[train.txt]

    A highly contagious virus spread by coughing, sneezing or direct contact with skin lesions.
    A contagious liver disease often caused by consuming contaminated food or water. It is the most common vaccine-preventable travel disease.
    A serious liver disease spread through contact with blood or body fluids. The hepatitis B virus can cause liver cancer and possible death.
    A group of over 100 viruses that spreads through sexual contact. HPV strains may cause genital warts and lead to cervical cancer.
    A potentially fatal bacterial infection that strikes an average of 1,500 Americans annually.
    

    测试文件的内容:[test.txt]

    died due to liver cancer.
    

    分类代码:[classifier.py]

    import codecs
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    trainfile = 'train.txt'
    testfile = 'test.txt'
    word_vectorizer = CountVectorizer(analyzer='word')
    trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
    tags = ['CHICKEN POX','HEPATITIS A','HEPATITIS B','Human papillomavirus','MENINGITIS']
    mnb = MultinomialNB()
    mnb.fit(trainset, tags)
    codecs.open(testfile,'r','utf8')
    testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
    results = mnb.predict(testset)
    print results