将NLTK NaiveBayesClassifier中最具信息性的功能存储在列表中

时间:2017-03-23 08:23:58

标签: python scikit-learn classification nltk naivebayes

我在python中尝试这个Naive Bayes分类器:

classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)

我有以下输出:

Console Output

可以清楚地看到哪些词出现在"重要"和#34;垃圾邮件"类别..但我不能使用这些值..我实际上想要一个看起来像这样的列表:

[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]

我是python的新手并且很难搞清楚所有这些,有人可以帮忙吗?我将非常感激。

3 个答案:

答案 0 :(得分:2)

您可以稍微修改source code of show_most_informative_features以适合您的目的。

子列表的第一个元素对应于信息量最大的特征名称,而第二个元素对应于它的标签(更具体地说,是与比率的分子术语相关联的标签)。

辅助功能:

def show_most_informative_features_in_list(classifier, n=10):
    """
    Return a nested list of the "most informative" features 
    used by the classifier along with it's predominant labels
    """
    cpdist = classifier._feature_probdist       # probability distribution for feature values given labels
    feature_list = []
    for (fname, fval) in classifier.most_informative_features(n):
        def labelprob(l):
            return cpdist[l, fname].prob(fval)
        labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()], 
                        key=labelprob)
        feature_list.append([fname, labels[-1]])
    return feature_list

在对nltk的正/负电影评论语料库进行过培训的分类器上进行测试:

show_most_informative_features_in_list(classifier, 10)

产生

[['outstanding', 'pos'],
 ['ludicrous', 'neg'],
 ['avoids', 'pos'],
 ['astounding', 'pos'],
 ['idiotic', 'neg'],
 ['atrocious', 'neg'],
 ['offbeat', 'pos'],
 ['fascination', 'pos'],
 ['symbol', 'pos'],
 ['animators', 'pos']]

答案 1 :(得分:1)

只需使用most_informative_features()

即可

使用Classification using movie review corpus in NLTK/Python中的示例:

MyClass obj = new MyClass();
obj.setId("newStringId");
session.save(obj);
session.flush();
session.refresh(obj);
obj.getAutoValue();   // should have the auto generated value

然后,简单地说:

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

[OUT]:

print classifier.most_informative_features()

并列出所有功能:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True)]

[OUT]:

classifier.most_informative_features(n=len(word_features))

澄清:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True),
 ('commercially', True),
 ('utilize', True),
 ('gratuitous', True),
 ('visible', True),
 ('internet', True),
 ('disillusioned', True),
 ('boost', True),
 ('preventing', True),
 ('built', True),
 ('repairs', True),
 ('overplaying', True),
 ('election', True),
 ('caterer', True),
 ('decks', True),
 ('retiring', True),
 ('pivot', True),
 ('outwitting', True),
 ('solace', True),
 ('benches', True),
 ('terrorizes', True),
 ('billboard', True),
 ('catalogue', True),
 ('clean', True),
 ('skits', True),
 ('nice', True),
 ('feature', True),
 ('must', True),
 ('withdrawn', True),
 ('indulgence', True),
 ('tribal', True),
 ('freeman', True),
 ('must', False),
 ('nice', False),
 ('feature', False),
 ('gratuitous', False),
 ('turturro', False),
 ('built', False),
 ('internet', False),
 ('rescued', False),
 ('clean', False),
 ('overacts', False),
 ('gregor', False),
 ('conflicted', False),
 ('taboo', False),
 ('inhabiting', False),
 ('utilize', False),
 ('churns', False),
 ('boost', False),
 ('stepdaughter', False),
 ('complementary', False),
 ('gleiberman', False),
 ('skylar', False),
 ('kirkpatrick', False),
 ('hardship', False),
 ('election', False),
 ('inform', False),
 ('disillusioned', False),
 ('visible', False),
 ('commercially', False),
 ('frosted', False),
 ('pup', False),
 ('apologizing', False),
 ('freeman', False),
 ('preventing', False),
 ('nutsy', False),
 ('intrinsics', False),
 ('somalia', False),
 ('coordinators', False),
 ('strengthening', False),
 ('impatience', False),
 ('subtely', False),
 ('426', False),
 ('schreber', False),
 ('brimley', False),
 ('motherload', False),
 ('creepily', False),
 ('perturbed', False),
 ('accountants', False),
 ('beringer', False),
 ('scrubs', False),
 ('1830s', False),
 ('analogue', False),
 ('espouses', False),
 ('xv', False),
 ('skits', False),
 ('solace', False),
 ('reduncancy', False),
 ('parenthood', False),
 ('insulators', False),
 ('mccoll', False)]

进一步澄清,如果功能集中使用的标签是字符串,>>> type(classifier.most_informative_features(n=len(word_features))) list >>> type(classifier.most_informative_features(10)[0][1]) bool 将返回一个字符串,例如

most_informative_features()

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative'  for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

答案 2 :(得分:0)

对于朴素贝叶斯,最有用的功能(最具区别性或差异性标记)将是两个类之间的p(词|类)之间的差异最大的值。

首先必须进行一些文本操作和标记化,以便最终获得两个列表。在所有标记为A类的字符串中存在的所有标记的一个列表。在所有标记为B类的字符串中存在的所有标记的另一个列表。这两个列表应包含重复的标记,我们可以对其进行计数并创建频率分布。

运行以下代码:

classA_freq_distribution = nltk.FreqDist(classAWords)
classB_freq_distribution = nltk.FreqDist(classBWords)
classA_word_features = list(classA_freq_distribution.keys())[:3000]
classB_word_features = list(classB_freq_distribution.keys())[:3000]

这将从每个列表中获取前3000个功能,但是您可以选择3000个以外的其他数字。现在您有了一个频率分布,可以计算p(word | class),然后查看两次校准之间的差异

diff = []
features = []
for feature in classA_word_features:
    features.append(feature)

    diff.append(classB_freq_distribution[feature]
    /len(classBWords) 
    - classA_freq_distribution[feature]/len(classAWords))
all_features = pd.DataFrame({
    'Feature': features,
    'Diff': diff
})

然后,您可以排序并查看价值最高和最低的单词。

sorted = all_features.sort_values(by=['Diff'], ascending=False)
print(sorted)