Question

我一直在玩NLTK工具包。我经常遇到这个问题并在网上寻找解决方案，但我无处可寻。所以我在这里提出我的问题。

很多时候，NER并未将连续的NNP标记为一个NE。我认为编辑NER以使用RegexpTagger也可以改善NER。

示例：

输入：

巴拉克奥巴马是一个伟大的人。

输出：

树（＆＃39; S＆＃39;，[树（＆＃39; PERSON＆＃39;，[（＆＃39; Barack＆＃39;，＆＃39; NNP＆＃39;）]，树（＆＃39;组织＆＃39;，[（＆＃39;奥巴马＆＃39;，＃N＆N;＆＃39;）]），（＆＃39;是＆＃39;，＆＃39; VBZ＆＃39; ;），（＆＃39; a＆＃39;，＆＃39; DT＆＃39;），（＆＃39;伟大的＆＃39;，＆＃39; JJ＆＃39;），＆＃39; person＆＃ 39;，＆＃39; NN＆＃39;），（＆＃39;。＆＃39;，＆＃39;。＆＃39;）]）

其中

输入：

前副总统迪克·切尼告诉保守派电台主持人劳拉·英格拉哈姆，他很荣幸能够获得奖励。与Darth Vader在办公室比较。

输出：

Tree（＆＃39; S＆＃39;，[（＆＃39;前＆＃39;，＆＃39; JJ＆＃39;），＆＃39; Vice＆＃39;，＆＃39; NNP＆＃39;），（＆＃39;总裁＆＃39;，＆＃39; NNP＆＃39;），树（＆＃39; NE＆＃39;，[（＆＃39; Dick＆＃39;，＆＃39; ; NNP＆＃39;），（＆＃39;切尼＆＃39; NNP＆＃39;）]），（＆＃39;告诉＆＃39;，＆＃39; VBD＆＃39;），（＆＃39;保守的＆＃39;，＆＃39; JJ＆＃39;），（＆＃39; radio＆＃39;，＆＃39; NN＆＃39;），＆＃39; host＆＃39;，＆＃39; NN＆＃39;），树（＆＃39; NE＆＃39;，[（＆＃39; Laura＆＃39;，＆＃39; NNP＆＃39;），＆＃39; Ingraham＆＃39; ，＆＃39; NNP＆＃39;）]），（＆＃39;＆＃39;，＆＃39; IN＆＃39;），＆＃39;他＆＃39;，＆＃39; PRP＆＃39; ;），（＆＃39; ', '＆＃39;），（＆＃39;是＆＃39;，＆＃39; VBD＆＃39;），（＆＃39;荣幸＆＃39;，＆＃39; VBN＆＃39;），（＆＃34;＆＃39;＆＃39;＆＃34;，＆＃34;＆＃39;＆＃39;＆＃34;），（＆＃39; to＆＃39;，＆＃39; TO＆＃39;），（＆＃39;＆＃39;，＆＃39; VB＆＃39;），（＆＃39;比较＆＃39;，＆＃39; VBN＆＃39;），（＆＃39; to＆＃39;，＆＃39; TO＆＃39;），树（＆＃39; NE＆＃39;，[（＆＃39; Darth＆＃39;，＆＃39; ; NNP＆＃39;），（＆＃39; Vader＆＃39;，＆＃39; NNP＆＃39;）]），（＆＃39; while＆＃39;，＆＃39; IN＆＃39;），（＆＃39; in＆＃39;，＆＃39; IN＆＃39;），（＆＃39; office＆＃39;，＆＃39; NN＆＃39;），（＆＃39;。＆＃39;，＆＃39;＆＃39）]）

在这里，副总统/ NNP，总统/ NNP（Dick / NNP，Cheney / NNP）被正确提取。

所以我认为如果首先使用nltk.ne_chunk，那么如果两个连续的树是NNP，那么两者都很有可能引用一个实体。

任何建议都会非常感激。我正在寻找我的方法中的缺陷。

感谢。

Answer 1

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)

[OUT]：

['Barack Obama']

但请注意，如果连续的块不应该是单个NE，那么您将把多个NE组合成一个。我无法想到这样的例子，但我确信它会发生。但如果它们不连续，上面的脚本运行正常：

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

Answer 2

@ alvas的答案中有一个错误。 Fencepost错误。确保在循环外部运行那个elif检查，这样你就不会在句子末尾留下NE。所以：

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

txt = "Barack Obama is a great person and so is Michelle Obama." 
print get_continuous_chunks(txt)

Answer 3

@alvas很棒的答案。这真的很有帮助。我试图以更实用的方式捕获您的解决方案。尽管如此，仍然需要改进。

    def conditions(tree_node):
    return tree_node.height() == 2

    def coninuous_entities(self, input_text, file_handle):
      from nltk import ne_chunk, pos_tag, word_tokenize
      from nltk.tree import Tree

      # Note: Currently, the chunker categorizes only 2 'NNP' together.  
      docs = input_text.split('\n')
      for input_text in docs:
          chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
          child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]

          named_entities = []
          for child in child_data:
              if type(child) == Tree:
                  named_entities.append(" ".join([token for token, pos in child.leaves()]))

          # Dump all entities to file for now, we will see how to go about that
          if file_handle is not None:
              file_handle.write('\n'.join(named_entities) + '\n')
      return named_entities

使用条件功能可以添加许多条件进行过滤。

具有正则表达式的命名实体识别：NLTK

3 个答案: