Question

我正在尝试用NLTK实现Naive Bayes。

当我打印出信息量最大的功能时，其中一些被指定为“无”。那是为什么？

我正在使用单词模型包：当我输出功能时，每个功能都被指定为true。

Whare没有来自？

我读了那个

The feature value 'None' is reserved for unseen feature values;

此处：http://www.nltk.org/_modules/nltk/classify/naivebayes.html

这是什么意思？

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import nltk.data
from nltk.corpus import stopwords
import collections
from nltk.classify.util import accuracy
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import nltk.metrics

def bag_of_words(words):
    return dict([(word, True) for word in words])

def bag_of_words_not_in_set(words, badwords):
    return bag_of_words(set(words) - set(badwords))

def bag_of_words_without_stopwords(words):
    badwords = stopwords.words("german")
    return bag_of_words_not_in_set(words, badwords)

def label_feats_from_corpus(corp, feature_detector=bag_of_words_without_stopwords):
    label_feats = collections.defaultdict(list)
    for label in corp.categories():
        for fileid in corp.fileids(categories=[label]):
            feats = feature_detector(corp.words(fileids=[fileid]))
            label_feats[label].append(feats)
    return label_feats

def split_label_feats(lfeats, split=0.75):
    train_feats = []
    test_feats = []
    for label, feats in lfeats.items():
        cutoff = int(len(feats) * split)
        train_feats.extend([(feat, label) for feat in feats[:cutoff]])
        test_feats.extend([(feat, label) for feat in feats[cutoff:]])
    return train_feats, test_feats


reader = CategorizedPlaintextCorpusReader('D:/corpus/', r'.*\.txt', cat_pattern=r'(\w+)/*')

all_words = nltk.FreqDist(w.lower() for w in reader.words())

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

bigrams = bigram_word_feats(reader.words());

lfeats = label_feats_from_corpus(reader)

train_feats, test_feats = split_label_feats(lfeats, split=0.75)
len(train_feats)
nb_classifier = NaiveBayesClassifier.train(train_feats)


print("------------------------")
acc = accuracy(nb_classifier, test_feats)
print(acc)
print("------------------------")
feats = nb_classifier.most_informative_features(n=25)
for feat in feats:
    print(feat) # some are NONE

print("------------------------")
nb_classifier.show_most_informative_features(n=25) # some are NONE

Answer 1

我认为NaiveBayesClassifier类的完整文档字符串解释了：

如果分类器遇到具有特征的输入       从来没有见过任何标签，而不是分配一个       对所有标签的概率为0，它将忽略该特征。

为看不见的特征值保留特征值“无”;       你通常不应该使用'None'作为其中一个的特征值       你自己的特色。

如果您的数据包含从未与标签关联的功能，则该功能的值将为None。假设您使用功能W，X训练分类器，然后使用功能W，X，Z对某些内容进行分类。值None将用于功能Z，因为该功能在培训中从未见过。

进一步说明：

看到None并不让我感到惊讶，因为语言数据很少。在电影评论的语料库中，将会出现仅出现在1或2个文档中的单词。例如，演员的姓名或标题中的单词可能只出现在1个评论中。

在分析之前从语料库中删除频繁（停止）和不频繁的单词很常见。对于 Science 的主题模型，Blei and Lafferty (2007)写：“此集合中的总词汇量大小为375,144个术语。我们修改了356,195个术语，其发生次数少于70次以及296停止说话。“

NLTK，Naive Bayes：为什么有些功能没有？

1 个答案: