Question

我在附近一直在玩一些文字，上面有一些关于英语内容的英语摘要。我正在尝试对两个文本都执行NER，以提取日期，位置和人。我从英语部分开始，认为它应该更容易使用并使用分块。日期不被识别，不是所有实体都被捕获。有没有一种方法可以自定义输出以使其更加准确。这是我的代码示例：

text = 'Thursday, 3 September 1467. The Jew Azar Nifusi leases his fields called Ta Xellula and Gnien Hagem in the district of Dejr is-Safsaf for ten years to Nicolaus Delia and his son Lemus for the price of eight salme of wheat each harvest-time. The tenants also bind themselves to give Nifusi each year ten salme of brushwood and two salme of straw. On his part the Jew promised to build a surrounding wall for the fields at his own expense.'
import nltk 
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary = True)

entity_names = []
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))
print set(entity_names)

这是我得到的输出：

set(['Nicolaus Delia', 'Gnien Hagem', 'Dejr', 'Nifusi', 'Jew'])

我原以为至少要提取犹太人，犹太人，阿扎尔·尼富西，塔·希鲁拉，格尼·哈根姆，代尔·萨法夫，尼古拉·迪莉亚和莱姆斯的日期。有什么帮助吗？

Answer 1

使用这一行代码可以给我诸如日期之类的东西。它采用树格式，但是我假设您以后可以自己以更简洁的格式提取内容。

ne_chunk(pos_tag(word_tokenize(text)))

输出：

Tree('S', [('Thursday', 'NNP'), (',', ','), ('3', 'CD'), ('September', 'NNP'), ('1467', 'CD'), ('.', '.'), ('The', 'DT'), Tree('ORGANIZATION', [('Jew', 'NNP'), ('Azar', 'NNP'), ('Nifusi', 'NNP')]), ('leases', 'VBZ'), ('his', 'PRP$'), ('fields', 'NNS'), ('called', 'VBD'), Tree('PERSON', [('Ta', 'NNP'), ('Xellula', 'NNP')]), ('and', 'CC'), Tree('PERSON', [('Gnien', 'NNP'), ('Hagem', 'NNP')]), ('in', 'IN'), ('the', 'DT'), ('district', 'NN'), ('of', 'IN'), Tree('GPE', [('Dejr', 'NNP')]), ('is-Safsaf', 'NN'), ('for', 'IN'), ('ten', 'JJ'), ('years', 'NNS'), ('to', 'TO'), Tree('PERSON', [('Nicolaus', 'NNP'), ('Delia', 'NNP')]), ('and', 'CC'), ('his', 'PRP$'), ('son', 'NN'), Tree('PERSON', [('Lemus', 'NNP')]), ('for', 'IN'), ('the', 'DT'), ('price', 'NN'), ('of', 'IN'), ('eight', 'CD'), ('salme', 'NNS'), ('of', 'IN'), ('wheat', 'NN'), ('each', 'DT'), ('harvest-time', 'NN'), ('.', '.'), ('The', 'DT'), ('tenants', 'NNS'), ('also', 'RB'), ('bind', 'VBP'), ('themselves', 'PRP'), ('to', 'TO'), ('give', 'VB'), Tree('PERSON', [('Nifusi', 'NNP')]), ('each', 'DT'), ('year', 'NN'), ('ten', 'RB'), ('salme', 'NN'), ('of', 'IN'), ('brushwood', 'NN'), ('and', 'CC'), ('two', 'CD'), ('salme', 'NN'), ('of', 'IN'), ('straw', 'NN'), ('.', '.'), ('On', 'IN'), ('his', 'PRP$'), ('part', 'NN'), ('the', 'DT'), Tree('ORGANIZATION', [('Jew', 'NNP')]), ('promised', 'VBD'), ('to', 'TO'), ('build', 'VB'), ('a', 'DT'), ('surrounding', 'VBG'), ('wall', 'NN'), ('for', 'IN'), ('the', 'DT'), ('fields', 'NNS'), ('at', 'IN'), ('his', 'PRP$'), ('own', 'JJ'), ('expense', 'NN'), ('.', '.')])

从拉丁文和英文文本中提取日期，人员和位置

1 个答案: