nltk天真贝叶斯火车数据格式

时间:2018-04-26 10:12:20

标签: python classification nltk training-data naivebayes

我想要做的任务是使用NLTK Naive Bayes对命名实体识别进行分类。我有自己的语料库,看起来像这样

Sementara itu Pengamat Pasar Modal <ENAMEX TYPE="PERSON">Dandossi Matram</ENAMEX> mengatakan, sulit bagi sebuah <ENAMEX TYPE="ORGANIZATION">kantor akuntan publik</ENAMEX> (<ENAMEX TYPE="ORGANIZATION">KAP</ENAMEX>) untuk dapat menyelesaikan audit perusahaan sebesar <ENAMEX TYPE="ORGANIZATION">Telkom</ENAMEX> dalam waktu 3 bulan.   1

我已经提取了实体和值,所以它变成了这个

[[('PERSON', 'Dandossi Matram'), ('ORGANIZATION', 'kantor akuntan publik'), ('ORGANIZATION', 'KAP'), ('ORGANIZATION', 'Telkom')], [('ORGANIZATION', 'Telkom')],...]

然后我将列表扁平化并反转元组的值,以便顺序变为这样

[('Dandossi Matram', 'PERSON'), ('kantor akuntan publik', 'ORGANIZATION'), ('KAP', 'ORGANIZATION'), ('Telkom', 'ORGANIZATION'), ('Telkom', 'ORGANIZATION'),...]

因为NLTK Naive Bayes需要字典,所以我将元组列表转换为字典。

{'Agus': 'PERSON', 'Jawa Barat': 'LOCATION', 'Disney': 'ORGANIZATION', 'City': 'ORGANIZATION', 'manchestereveningnews': 'ORGANIZATION', 'Roma': 'LOCATION', 'PSG': 'ORGANIZATION', 'LPPNPI': 'ORGANIZATION', 'Telkomsel': 'ORGANIZATION', 'Federer': 'PERSON', 'Garuda Indonesia': 'ORGANIZATION',...}

我训练分类器时会弹出错误

  

表示fname,featureset.items()中的fval:   AttributeError:&#39; str&#39;对象没有属性&#39;项目&#39;

这很奇怪,因为我的数据格式是字典。请帮我。至于我的代码,这里是:

import re, nltk, random
ner_list = []
with open("ner_corpus.txt", "r") as f:
    for line in f:
        #print(line)
        tags = re.findall(r'<ENAMEX\s+TYPE=\"(.+?)\">(.+?)</ENAMEX>', line)
        #tags = re.findall(r'<(?:ENAMEX\s+TYPE)=.+?>(.+?)</(?:ENAMEX)>', line)
        ner_list.append(tags)
f.close()
ner_list = [item for sublist in ner_list for item in sublist]
ner_list = list(map(lambda x: (x[1], x[0]),ner_list))
random.shuffle(ner_list)
ner_corpus = dict(ner_list)
print(ner_corpus)
splitratio = 0.8
train = ner_list[:int(len(ner_list)*splitratio)]
test = ner_list[int(len(ner_list)*splitratio):]
print(len(train))
print(len(test))
classifier = nltk.NaiveBayesClassifier.train(train)
print(nltk.classify.accuracy(classifier, test))

0 个答案:

没有答案