我一直在玩NLTK工具包。我经常遇到这个问题并在网上寻找解决方案,但我无处可寻。所以我在这里提出我的问题。
很多时候,NER并未将连续的NNP标记为一个NE。我认为编辑NER以使用RegexpTagger也可以改善NER。
示例:
输入:
巴拉克奥巴马是一个伟大的人。
输出:
树(' S',[树(' PERSON',[(' Barack',' NNP')],树('组织',[('奥巴马',#N&N;')]),('是',' VBZ' ;),(' a',' DT'),('伟大的',' JJ'),' person&# 39;,' NN'),('。','。')])
其中
输入:
前副总统迪克·切尼告诉保守派电台主持人劳拉·英格拉哈姆,他很荣幸能够获得奖励。与Darth Vader在办公室比较。
输出:
Tree(' S',[('前',' JJ'),' Vice',' NNP& #39;),('总裁',' NNP'),树(' NE',[(' Dick',' ; NNP'),('切尼' NNP')]),('告诉',' VBD'),( '保守的',' JJ'),(' radio',' NN'),' host',& #39; NN'),树(' NE',[(' Laura',' NNP'),' Ingraham' ,' NNP')]),('',' IN'),'他',' PRP' ;),('
', '
'),('是',' VBD'),('荣幸',& #39; VBN'),("''","''"),(' to',' TO'),('',' VB'),('比较',' VBN& #39;),(' to',' TO'),树(' NE',[(' Darth',' ; NNP'),(' Vader',' NNP')]),(' while',' IN'),( ' in',' IN'),(' office',' NN'),('。', '&#39)])
在这里,副总统/ NNP,总统/ NNP(Dick / NNP,Cheney / NNP)被正确提取。
所以我认为如果首先使用nltk.ne_chunk,那么如果两个连续的树是NNP,那么两者都很有可能引用一个实体。
任何建议都会非常感激。我正在寻找我的方法中的缺陷。
感谢。
答案 0 :(得分:16)
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
return continuous_chunk
txt = "Barack Obama is a great person."
print get_continuous_chunks(txt)
[OUT]:
['Barack Obama']
但请注意,如果连续的块不应该是单个NE,那么您将把多个NE组合成一个。我无法想到这样的例子,但我确信它会发生。但如果它们不连续,上面的脚本运行正常:
>>> txt = "Barack Obama is the husband of Michelle Obama."
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']
答案 1 :(得分:5)
@ alvas的答案中有一个错误。 Fencepost错误。确保在循环外部运行那个elif检查,这样你就不会在句子末尾留下NE。所以:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
return continuous_chunk
txt = "Barack Obama is a great person and so is Michelle Obama."
print get_continuous_chunks(txt)
答案 2 :(得分:0)
@alvas很棒的答案。这真的很有帮助。我试图以更实用的方式捕获您的解决方案。尽管如此,仍然需要改进。
def conditions(tree_node):
return tree_node.height() == 2
def coninuous_entities(self, input_text, file_handle):
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
# Note: Currently, the chunker categorizes only 2 'NNP' together.
docs = input_text.split('\n')
for input_text in docs:
chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]
named_entities = []
for child in child_data:
if type(child) == Tree:
named_entities.append(" ".join([token for token, pos in child.leaves()]))
# Dump all entities to file for now, we will see how to go about that
if file_handle is not None:
file_handle.write('\n'.join(named_entities) + '\n')
return named_entities
使用条件功能可以添加许多条件进行过滤。