Question

这是我第一次使用Python中的nltk NaiveBayesClassifier构建情感分析机器学习模型。我知道模型太简单了，但这只是我的第一步，我下次会尝试标记化的句子。

我对现有模特的真正问题是：我已经清楚地标注了“坏”这个词。在训练数据集中为负数（正如您可以从“负面词汇”变量中看到的那样）。但是，当我在列表中的每个句子（小写）上运行NaiveBayesClassifier时，[＆＃39;真棒电影＆＃39;，＆＃39;我喜欢它＆＃39;，＆＃39;它是如此糟糕，分类器被错误地标记为“它是如此糟糕”＆＃39;积极的。

INPUT：

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')

def word_feat(word):
    return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].

for word in words:
    classResult = classifier.classify(word_feat(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
    print(str(word) + ' is ' + str(classResult))
    print()

输出：

awesome movie is pos

i like it is pos

it is so bad is pos

确保功能＆＃39; word_feat（word）＆＃39;迭代每个句子而不是每个单词或字母，我做了一些诊断代码，以查看＆＃39; word_feat（word）＆＃39;中的每个元素是什么：

for word in words:
    print(word_feat(word))

打印出来：

{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}

所以看起来像功能＆＃39; word_feat（word）＆＃39;是对的吗？

有没有人知道为什么分类器被归类为“它是如此糟糕”＆＃39;积极的？如前所述，我已经清楚地标注了“糟糕”这个词。在我的训练数据中是负面的。

Answer 1

这个特殊的失败是因为你的word_feats()函数需要一个单词列表（一个标记化的句子），但是你将它们分别传递给每个单词......所以word_feats()迭代它的字母。您已经构建了一个分类器，根据它们包含的字母将字符串分类为正数或负数。

你可能处于这种困境中，因为你不注意你对变量的命名。在您的主循环中，变量sentence，words或word都不包含其名称所声称的内容。要理解和改进您的计划，首先要正确命名。

除了错误之外，这不是你构建情绪分类器的方式。训练数据应该是一个标记化句子列表（每个句子标有其情绪），而不是单个单词列表。同样，您将标记化的句子分类。

Answer 2

让我展示一下你的代码的重写。我在顶部附近更改的是添加import re，因为它更容易使用正则表达式进行标记。定义classifier之前的所有其他内容与您的代码相同。

我又添加了一个测试用例（真的非常消极），但更重要的是我使用了正确的变量名称 - 然后对于发生的事情感到困惑要困难得多：

test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')

所以sentences现在包含4个字符串，每个字符串都是一个句子。我保持word_feat()功能不变。

对于使用分类器，我做了很大的改写：

for sentence in sentences:
    if(len(sentence) == 0):continue
    neg = 0
    pos = 0
    for word in re.findall(r"[\w']+", sentence):
        classResult = classifier.classify(word_feat(word))
        print(word, classResult)
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
    print("\n%s: %d vs -%d\n"%(sentence,pos,neg))

外部循环再次是描述性的，因此sentence包含一个句子。

然后我有一个内循环，我们在句子中对每个单词进行分类;我正在使用正则表达式将句子分成空格和标点符号：

 for word in re.findall(r"[\w']+", sentence):
     classResult = classifier.classify(word_feat(word))

其余的只是基本的加法和报告。我得到了这个输出：

awesome pos
movie neu

awesome movie: 1 vs -0

i pos
like pos
it pos

 i like it: 3 vs -0

it pos
is neu
so pos
bad neg

 it is so bad: 2 vs -1

i pos
hate neg
this pos
terrible neg
useless neg
movie neu

 i hate this terrible useless movie: 2 vs -3

我仍然和你一样 - ＆＃34;它是如此糟糕＆＃34;被认为是积极的。通过额外的调试线，我们可以看到它是因为＆＃34;它＆＃34;和＆＃34;所以＆＃34;被认为是积极的话，并且＆＃34;坏＆＃34;是唯一的负面词，所以总的来说是积极的。

我怀疑这是因为它在训练数据中没有看到这些词。

...是的，如果我添加＆＃34;它＆＃34;和＆＃34;所以＆＃34;在中性词列表中，我得到了＃34;它非常糟糕：0 vs -1＆＃34;。

接下来要尝试的事情，我建议：

尝试更多培训数据;像这样的玩具例子会带来噪音会淹没信号的风险。
考虑删除停用词。

Answer 3

以下是修改后的代码

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.')   # these are actually list of sentences

for sent in sentences:
    if sent != "":
        words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
        classResult = classifier.classify(word_feats(words))
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
        print(str(sent) + ' --> ' + str(classResult))
        print

我修改了你正在考虑的地方＆＃39;单词列表＆＃39;作为分类器的输入。但实际上你需要逐句传递句子，这意味着你需要传递句子列表＆＃39;

此外，对于每个句子，您需要将单词作为要素传递，这意味着您需要将句子拆分为空白字符。

另外，如果您希望分类器正常运行以进行情绪分析，则需要更少优先选择＆＃34; stop-words＆＃34;喜欢＆＃34;它，它们是等等＃34;。由于这些词语不足以判断该句子是积极的，消极的还是中立的。

以上代码给出了以下输出

awesome movie --> pos

 i like it --> pos

 it is so bad --> neg

因此对于任何分类器，训练分类器和预测分类器的输入格式应该相同。在训练您提供单词列表时，请尝试使用相同的方法转换您的测试集。

Answer 4

您可以尝试使用此代码

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<label>Price of gummball:</label><input name="gummball" id="gummball" type="text" />
<br /> Total: &euro; <span id="result"></span>

结果是：正：0.7142857142857143 负数：0.14285714285714285

为什么NLTK NaiveBayes分类器错误分类了一条记录？

4 个答案: