Question

我想要做的任务是使用NLTK Naive Bayes对命名实体识别进行分类。我有自己的语料库，看起来像这样

Sementara itu Pengamat Pasar Modal <ENAMEX TYPE="PERSON">Dandossi Matram</ENAMEX> mengatakan, sulit bagi sebuah <ENAMEX TYPE="ORGANIZATION">kantor akuntan publik</ENAMEX> (<ENAMEX TYPE="ORGANIZATION">KAP</ENAMEX>) untuk dapat menyelesaikan audit perusahaan sebesar <ENAMEX TYPE="ORGANIZATION">Telkom</ENAMEX> dalam waktu 3 bulan.   1

我已经提取了实体和值，所以它变成了这个

[[('PERSON', 'Dandossi Matram'), ('ORGANIZATION', 'kantor akuntan publik'), ('ORGANIZATION', 'KAP'), ('ORGANIZATION', 'Telkom')], [('ORGANIZATION', 'Telkom')],...]

然后我将列表扁平化并反转元组的值，以便顺序变为这样

[('Dandossi Matram', 'PERSON'), ('kantor akuntan publik', 'ORGANIZATION'), ('KAP', 'ORGANIZATION'), ('Telkom', 'ORGANIZATION'), ('Telkom', 'ORGANIZATION'),...]

因为NLTK Naive Bayes需要字典，所以我将元组列表转换为字典。

{'Agus': 'PERSON', 'Jawa Barat': 'LOCATION', 'Disney': 'ORGANIZATION', 'City': 'ORGANIZATION', 'manchestereveningnews': 'ORGANIZATION', 'Roma': 'LOCATION', 'PSG': 'ORGANIZATION', 'LPPNPI': 'ORGANIZATION', 'Telkomsel': 'ORGANIZATION', 'Federer': 'PERSON', 'Garuda Indonesia': 'ORGANIZATION',...}

我训练分类器时会弹出错误

表示fname，featureset.items（）中的fval： AttributeError：＆＃39; str＆＃39;对象没有属性＆＃39;项目＆＃39;

这很奇怪，因为我的数据格式是字典。请帮我。至于我的代码，这里是：

import re, nltk, random
ner_list = []
with open("ner_corpus.txt", "r") as f:
    for line in f:
        #print(line)
        tags = re.findall(r'<ENAMEX\s+TYPE=\"(.+?)\">(.+?)</ENAMEX>', line)
        #tags = re.findall(r'<(?:ENAMEX\s+TYPE)=.+?>(.+?)</(?:ENAMEX)>', line)
        ner_list.append(tags)
f.close()
ner_list = [item for sublist in ner_list for item in sublist]
ner_list = list(map(lambda x: (x[1], x[0]),ner_list))
random.shuffle(ner_list)
ner_corpus = dict(ner_list)
print(ner_corpus)
splitratio = 0.8
train = ner_list[:int(len(ner_list)*splitratio)]
test = ner_list[int(len(ner_list)*splitratio):]
print(len(train))
print(len(test))
classifier = nltk.NaiveBayesClassifier.train(train)
print(nltk.classify.accuracy(classifier, test))

nltk天真贝叶斯火车数据格式

0 个答案: