从这里开始:http://www.nltk.org/book/ch06.html(第1.3节)。
作者引用的准确度为.81。
当然,由于random.shuffle有一些随机性,但无论我运行多少次,我都无法超越.73。
(还有一个奇怪的是,作者声称下面的word_features包含2000个最常用的单词,但事实并非如此(与list(all_words.most_common(2000))相比。)
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)\
for category in movie_reviews.categories()\
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document, words_to_use = word_features):
document_words = set(document)
features = {}
for word in words_to_use:
features['contains({})'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
答案 0 :(得分:0)
list(all_words)[:2000]
相比, [word for word, _ in all_words.most_common(2000)]
绝对不是前2000个最常见的单词
不,准确性较差,因为在2000个最常见的单词(“ the”,“ a”,“ he” ...)中,大多数单词携带的有关类别的信息很少
当然,这些停用词并不是很有用(它们的信息量较少),但要考虑到其中有几个(〜30),其余的(〜1900)仍然具有良好的相关功能。
在这种情况下,我建议做什么:
random.shuffle(document)
来保证其顺序 python --version
python -c "import nltk; print(nltk.__version__)"