使用NLTK Corpus conll2002的荷兰语推文的情感分析

时间:2017-02-26 21:33:59

标签: python twitter nltk sentiment-analysis corpus

我需要对荷兰语的推文列表进行情绪分析,我正在使用conll2002。这是我正在使用的代码:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import conll2002
import time

t=time.time()

def word_feats(words):
    return dict([(word, True) for word in words])

#negids = conll2002.fileids('neg')
def train():
    #negids = conll2002.fileids('neg')
    #posids = conll2002.fileids('pos')
    negids = conll2002.fileids()
    posids = conll2002.fileids()

    negfeats = [(word_feats(conll2002.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(word_feats(conll2002.words(fileids=[f])), 'pos') for f in posids]

    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
    print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

    classifier = NaiveBayesClassifier.train(trainfeats)
    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    classifier.show_most_informative_features()
x=train()
print x
print time.time()-t

上面的代码有效但输出如下:

train on 8 instances, test on 4 instances
accuracy: 0.5
Most Informative Features
                poderlas = True              pos : neg    =      1.0 : 1.0
                   voert = True              pos : neg    =      1.0 : 1.0
            contundencia = True              pos : neg    =      1.0 : 1.0
          encuestocracia = None              pos : neg    =      1.0 : 1.0
                 alivien = None              pos : neg    =      1.0 : 1.0
                  Bogotá = True              pos : neg    =      1.0 : 1.0
          Especialidades = True              pos : neg    =      1.0 : 1.0
         hoofdredacteurs = True              pos : neg    =      1.0 : 1.0
               quisieron = True              pos : neg    =      1.0 : 1.0
               asciendan = None              pos : neg    =      1.0 : 1.0
None
9.21083234

对于所有情况,pos:neg比率为1:1。我该如何解决这个问题?我认为问题可能出现在我目前在代码中注释的以下语句中:

negids = conll2002.fileids('neg')
posids = conll2002.fileids('pos')

如果我没有注释掉上述两个陈述,我得到的错误是:

Traceback (most recent call last):
  File "naive1.py", line 31, in <module>
    x=train()
  File "naive1.py", line 13, in train
    negids = conll2002.fileids('neg')
TypeError: fileids() takes exactly 1 argument (2 given)

我尝试使用self来解决这个问题,但它仍然无效。有人可以指点我正确的方向吗?提前谢谢。

1 个答案:

答案 0 :(得分:0)

fileids()方法接受categories参数,但仅在分类语料库中。例如:

>>> from nltk.corpus import brown
>>> brown.fileids("mystery")
['cl01', 'cl02', 'cl03', 'cl04', 'cl05', 'cl06', 'cl07', 'cl08', 'cl09', 
'cl10', 'cl11', 'cl12', 'cl13', 'cl14', 'cl15', 'cl16', 'cl17', 'cl18', 
'cl19', 'cl20', 'cl21', 'cl22', 'cl23', 'cl24']

您的通话失败,因为CONLL语料库没有类别。这是因为它们没有注释的情绪:CONLL 2000和CONLL 2002都是分块语料库(NP / PP和分别命名的实体)。

>>> conll2002.categories()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ConllChunkCorpusReader' object has no attribute 'categories'

因此,对您的问题的简短回答是,您无法在conll2002语料库上训练情绪分析器。