我在python中尝试这个Naive Bayes分类器:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)
我有以下输出:
可以清楚地看到哪些词出现在"重要"和#34;垃圾邮件"类别..但我不能使用这些值..我实际上想要一个看起来像这样的列表:
[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]
我是python的新手并且很难搞清楚所有这些,有人可以帮忙吗?我将非常感激。
答案 0 :(得分:2)
您可以稍微修改source code of show_most_informative_features
以适合您的目的。
子列表的第一个元素对应于信息量最大的特征名称,而第二个元素对应于它的标签(更具体地说,是与比率的分子术语相关联的标签)。
辅助功能:
def show_most_informative_features_in_list(classifier, n=10):
"""
Return a nested list of the "most informative" features
used by the classifier along with it's predominant labels
"""
cpdist = classifier._feature_probdist # probability distribution for feature values given labels
feature_list = []
for (fname, fval) in classifier.most_informative_features(n):
def labelprob(l):
return cpdist[l, fname].prob(fval)
labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()],
key=labelprob)
feature_list.append([fname, labels[-1]])
return feature_list
在对nltk
的正/负电影评论语料库进行过培训的分类器上进行测试:
show_most_informative_features_in_list(classifier, 10)
产生
[['outstanding', 'pos'],
['ludicrous', 'neg'],
['avoids', 'pos'],
['astounding', 'pos'],
['idiotic', 'neg'],
['atrocious', 'neg'],
['offbeat', 'pos'],
['fascination', 'pos'],
['symbol', 'pos'],
['animators', 'pos']]
答案 1 :(得分:1)
只需使用most_informative_features()
使用Classification using movie review corpus in NLTK/Python中的示例:
MyClass obj = new MyClass();
obj.setId("newStringId");
session.save(obj);
session.flush();
session.refresh(obj);
obj.getAutoValue(); // should have the auto generated value
然后,简单地说:
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
[OUT]:
print classifier.most_informative_features()
并列出所有功能:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True)]
[OUT]:
classifier.most_informative_features(n=len(word_features))
澄清:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True),
('commercially', True),
('utilize', True),
('gratuitous', True),
('visible', True),
('internet', True),
('disillusioned', True),
('boost', True),
('preventing', True),
('built', True),
('repairs', True),
('overplaying', True),
('election', True),
('caterer', True),
('decks', True),
('retiring', True),
('pivot', True),
('outwitting', True),
('solace', True),
('benches', True),
('terrorizes', True),
('billboard', True),
('catalogue', True),
('clean', True),
('skits', True),
('nice', True),
('feature', True),
('must', True),
('withdrawn', True),
('indulgence', True),
('tribal', True),
('freeman', True),
('must', False),
('nice', False),
('feature', False),
('gratuitous', False),
('turturro', False),
('built', False),
('internet', False),
('rescued', False),
('clean', False),
('overacts', False),
('gregor', False),
('conflicted', False),
('taboo', False),
('inhabiting', False),
('utilize', False),
('churns', False),
('boost', False),
('stepdaughter', False),
('complementary', False),
('gleiberman', False),
('skylar', False),
('kirkpatrick', False),
('hardship', False),
('election', False),
('inform', False),
('disillusioned', False),
('visible', False),
('commercially', False),
('frosted', False),
('pup', False),
('apologizing', False),
('freeman', False),
('preventing', False),
('nutsy', False),
('intrinsics', False),
('somalia', False),
('coordinators', False),
('strengthening', False),
('impatience', False),
('subtely', False),
('426', False),
('schreber', False),
('brimley', False),
('motherload', False),
('creepily', False),
('perturbed', False),
('accountants', False),
('beringer', False),
('scrubs', False),
('1830s', False),
('analogue', False),
('espouses', False),
('xv', False),
('skits', False),
('solace', False),
('reduncancy', False),
('parenthood', False),
('insulators', False),
('mccoll', False)]
进一步澄清,如果功能集中使用的标签是字符串,>>> type(classifier.most_informative_features(n=len(word_features)))
list
>>> type(classifier.most_informative_features(10)[0][1])
bool
将返回一个字符串,例如
most_informative_features()
和
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
答案 2 :(得分:0)
对于朴素贝叶斯,最有用的功能(最具区别性或差异性标记)将是两个类之间的p(词|类)之间的差异最大的值。
首先必须进行一些文本操作和标记化,以便最终获得两个列表。在所有标记为A类的字符串中存在的所有标记的一个列表。在所有标记为B类的字符串中存在的所有标记的另一个列表。这两个列表应包含重复的标记,我们可以对其进行计数并创建频率分布。>
运行以下代码:
classA_freq_distribution = nltk.FreqDist(classAWords)
classB_freq_distribution = nltk.FreqDist(classBWords)
classA_word_features = list(classA_freq_distribution.keys())[:3000]
classB_word_features = list(classB_freq_distribution.keys())[:3000]
这将从每个列表中获取前3000个功能,但是您可以选择3000个以外的其他数字。现在您有了一个频率分布,可以计算p(word | class),然后查看两次校准之间的差异
diff = []
features = []
for feature in classA_word_features:
features.append(feature)
diff.append(classB_freq_distribution[feature]
/len(classBWords)
- classA_freq_distribution[feature]/len(classAWords))
all_features = pd.DataFrame({
'Feature': features,
'Diff': diff
})
然后,您可以排序并查看价值最高和最低的单词。
sorted = all_features.sort_values(by=['Diff'], ascending=False)
print(sorted)