如何解释ntlk软件包中的“信息最多的功能”

时间:2019-06-02 14:52:31

标签: python nlp

我是NLP的新手,正在努力解释当我看一下最重要功能的NLP分类的简单示例时得到的结果。具体来说,在下面显示的常见示例中,我不明白为什么“ this”一词在3/5个否定情感句子和3/5个肯定句中出现时能提供信息?

train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

from nltk.tokenize import word_tokenize # or use some other tokenizer
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

import nltk
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()

以下是结果:

Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                     not = False             pos : neg    =      1.2 : 1.0
                      do = False             pos : neg    =      1.2 : 1.0
                    very = False             neg : pos    =      1.2 : 1.0

有什么想法吗?我希望解释一下计算单词的概率/信息量的公式是什么。

我也做了这个超级简单的例子:

train = [('love', 'pos'),
('love', 'pos'),
('love', 'pos'),
('bad', 'pos'),
("bad", 'pos'),
('bad', 'neg'),
('bad', 'neg'),
("bad", 'neg'),
('bad', 'neg'),
('love', 'neg')]

并获得以下信息:


Most Informative Features
                     bad = False             pos : neg    =      2.3 : 1.0
                    love = True              pos : neg    =      2.3 : 1.0
                    love = False             neg : pos    =      1.8 : 1.0
                     bad = True              neg : pos    =      1.8 : 1.0

哪个方向正确似乎与我可以计算出的任何似然比都不匹配。

1 个答案:

答案 0 :(得分:0)

从nltk的documentationshow_most_informative_features()的来源,

  

功能(fname,fval)的信息量等于           对于任何标签,P(fname = fval | label)的最大值除以           对于任何标签,P(fname = fval | label)的最小值。

但是,对于您的情况,根本没有足够的数据点来计算这些概率,即概率分布大致平坦,您可以从要素的原始权重值中看到。这可能就是为什么无关的功能被标记为最重要的原因。如果您仅增加3-4个句子进行实验,就会注意到这种变化。