我想要做的任务是使用NLTK Naive Bayes对命名实体识别进行分类。我有自己的语料库,看起来像这样
Sementara itu Pengamat Pasar Modal <ENAMEX TYPE="PERSON">Dandossi Matram</ENAMEX> mengatakan, sulit bagi sebuah <ENAMEX TYPE="ORGANIZATION">kantor akuntan publik</ENAMEX> (<ENAMEX TYPE="ORGANIZATION">KAP</ENAMEX>) untuk dapat menyelesaikan audit perusahaan sebesar <ENAMEX TYPE="ORGANIZATION">Telkom</ENAMEX> dalam waktu 3 bulan. 1
我已经提取了实体和值,所以它变成了这个
[[('PERSON', 'Dandossi Matram'), ('ORGANIZATION', 'kantor akuntan publik'), ('ORGANIZATION', 'KAP'), ('ORGANIZATION', 'Telkom')], [('ORGANIZATION', 'Telkom')],...]
然后我将列表扁平化并反转元组的值,以便顺序变为这样
[('Dandossi Matram', 'PERSON'), ('kantor akuntan publik', 'ORGANIZATION'), ('KAP', 'ORGANIZATION'), ('Telkom', 'ORGANIZATION'), ('Telkom', 'ORGANIZATION'),...]
因为NLTK Naive Bayes需要字典,所以我将元组列表转换为字典。
{'Agus': 'PERSON', 'Jawa Barat': 'LOCATION', 'Disney': 'ORGANIZATION', 'City': 'ORGANIZATION', 'manchestereveningnews': 'ORGANIZATION', 'Roma': 'LOCATION', 'PSG': 'ORGANIZATION', 'LPPNPI': 'ORGANIZATION', 'Telkomsel': 'ORGANIZATION', 'Federer': 'PERSON', 'Garuda Indonesia': 'ORGANIZATION',...}
我训练分类器时会弹出错误
表示fname,featureset.items()中的fval: AttributeError:&#39; str&#39;对象没有属性&#39;项目&#39;
这很奇怪,因为我的数据格式是字典。请帮我。至于我的代码,这里是:
import re, nltk, random
ner_list = []
with open("ner_corpus.txt", "r") as f:
for line in f:
#print(line)
tags = re.findall(r'<ENAMEX\s+TYPE=\"(.+?)\">(.+?)</ENAMEX>', line)
#tags = re.findall(r'<(?:ENAMEX\s+TYPE)=.+?>(.+?)</(?:ENAMEX)>', line)
ner_list.append(tags)
f.close()
ner_list = [item for sublist in ner_list for item in sublist]
ner_list = list(map(lambda x: (x[1], x[0]),ner_list))
random.shuffle(ner_list)
ner_corpus = dict(ner_list)
print(ner_corpus)
splitratio = 0.8
train = ner_list[:int(len(ner_list)*splitratio)]
test = ner_list[int(len(ner_list)*splitratio):]
print(len(train))
print(len(test))
classifier = nltk.NaiveBayesClassifier.train(train)
print(nltk.classify.accuracy(classifier, test))