我有以下代码来估计一串文本属于特定类(正面或负面)的概率。
import pickle
from nltk.util import ngrams
classifier0 = open("C:/Users/ned/Desktop/gherkin.pickle","rb")
classifier = pickle.load(classifier0)
words = ['boring', 'and', 'stupid', 'movie']
feats = dict([(word, True) for word in words])
classifier.classify(feats)
probs = classifier.prob_classify(feats)
for sample in ('neg', 'pos'):
print('%s probability: %s' % (sample, probs.prob(sample)))
它产生以下结果:
neg probability: 0.944
pos probability: 0.055
[Finished in 24.7s]
我加载的酸洗分类器已经使用了n-gram。
我的问题是:
如何编辑此代码以便将n-gram合并到概率估算中?
答案 0 :(得分:2)
将ngrams添加到您的功能字典......
import pickle
from nltk.util import ngrams
fin = open("C:/Users/ned/Desktop/gherkin.pickle","rb")
classifier = pickle.load(fin)
words = ['boring', 'and', 'stupid', 'movie']
ngram_list = words + list(ngrams(words, 2)) + list(ngrams(words, 3))
feats = dict([(word, True) for word in ngram_list])
dist = classifier.prob_classify(feats)
for sample in dist.samples():
print("%s probability: %f" % (sample, dist.prob(sample)))
示例输出......
$ python movie-classifer-example.py
neg probability: 0.999138
pos probability: 0.000862
答案 1 :(得分:0)
根据N-Gram分类器(n用于训练),您可以生成n-gram并使用分类器对它们进行分类,从而获得这些概率。
要生成新实例,请使用以下示例:(仅适用于bi-gram和tri-gram)。
import nltk
words = nltk.word_tokenize(text) # or your list
bigrams = nltk.bigrams(words)
trigrams = nltk.trigrams(words)