NLTK,Naive Bayes:为什么有些功能没有?

时间:2016-04-22 13:12:07

标签: python nltk naivebayes

我正在尝试用NLTK实现Naive Bayes。

当我打印出信息量最大的功能时,其中一些被指定为“无”。那是为什么?

我正在使用单词模型包:当我输出功能时,每个功能都被指定为true。

Whare没有来自?

我读了那个

The feature value 'None' is reserved for unseen feature values;

此处:http://www.nltk.org/_modules/nltk/classify/naivebayes.html

这是什么意思?

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import nltk.data
from nltk.corpus import stopwords
import collections
from nltk.classify.util import accuracy
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import nltk.metrics

def bag_of_words(words):
    return dict([(word, True) for word in words])

def bag_of_words_not_in_set(words, badwords):
    return bag_of_words(set(words) - set(badwords))

def bag_of_words_without_stopwords(words):
    badwords = stopwords.words("german")
    return bag_of_words_not_in_set(words, badwords)

def label_feats_from_corpus(corp, feature_detector=bag_of_words_without_stopwords):
    label_feats = collections.defaultdict(list)
    for label in corp.categories():
        for fileid in corp.fileids(categories=[label]):
            feats = feature_detector(corp.words(fileids=[fileid]))
            label_feats[label].append(feats)
    return label_feats

def split_label_feats(lfeats, split=0.75):
    train_feats = []
    test_feats = []
    for label, feats in lfeats.items():
        cutoff = int(len(feats) * split)
        train_feats.extend([(feat, label) for feat in feats[:cutoff]])
        test_feats.extend([(feat, label) for feat in feats[cutoff:]])
    return train_feats, test_feats


reader = CategorizedPlaintextCorpusReader('D:/corpus/', r'.*\.txt', cat_pattern=r'(\w+)/*')

all_words = nltk.FreqDist(w.lower() for w in reader.words())

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

bigrams = bigram_word_feats(reader.words());

lfeats = label_feats_from_corpus(reader)

train_feats, test_feats = split_label_feats(lfeats, split=0.75)
len(train_feats)
nb_classifier = NaiveBayesClassifier.train(train_feats)


print("------------------------")
acc = accuracy(nb_classifier, test_feats)
print(acc)
print("------------------------")
feats = nb_classifier.most_informative_features(n=25)
for feat in feats:
    print(feat) # some are NONE

print("------------------------")
nb_classifier.show_most_informative_features(n=25) # some are NONE

1 个答案:

答案 0 :(得分:2)

我认为NaiveBayesClassifier类的完整文档字符串解释了:

  

如果分类器遇到具有特征的输入       从来没有见过任何标签,而不是分配一个       对所有标签的概率为0,它将忽略该特征。

     

为看不见的特征值保留特征值“无”;       你通常不应该使用'None'作为其中一个的特征值       你自己的特色。

如果您的数据包含从未与标签关联的功能,则该功能的值将为None。假设您使用功能WX训练分类器,然后使用功能WXZ对某些内容进行分类。值None将用于功能Z,因为该功能在培训中从未见过。

进一步说明:

看到None并不让我感到惊讶,因为语言数据很少。在电影评论的语料库中,将会出现仅出现在1或2个文档中的单词。例如,演员的姓名或标题中的单词可能只出现在1个评论中。

在分析之前从语料库中删除频繁(停止)和不频繁的单词很常见。对于 Science 的主题模型,Blei and Lafferty (2007)写:“此集合中的总词汇量大小为375,144个术语。我们修改了356,195个术语,其发生次数少于70次以及296停止说话。“