具有正则表达式的命名实体识别:NLTK

时间:2014-06-25 00:45:05

标签: regex nlp nltk named-entity-recognition

我一直在玩NLTK工具包。我经常遇到这个问题并在网上寻找解决方案,但我无处可寻。所以我在这里提出我的问题。

很多时候,NER并未将连续的NNP标记为一个NE。我认为编辑NER以使用RegexpTagger也可以改善NER。

示例:

输入:

  巴拉克奥巴马是一个伟大的人。

输出:

  

树(' S',[树(' PERSON',[(' Barack',' NNP')],树('组织',[('奥巴马',#N&N;')]),('是',' VBZ' ;),(' a',' DT'),('伟大的',' JJ'),' person&# 39;,' NN'),('。','。')])

其中

输入:

  

前副总统迪克·切尼告诉保守派电台主持人劳拉·英格拉哈姆,他很荣幸能够获得奖励。与Darth Vader在办公室比较。

输出:

  

Tree(' S',[('前',' JJ'),' Vice',' NNP& #39;),('总裁',' NNP'),树(' NE',[(' Dick',' ; NNP'),('切尼' NNP')]),('告诉',' VBD'),( '保守的',' JJ'),(' radio',' NN'),' host',& #39; NN'),树(' NE',[(' Laura',' NNP'),' Ingraham' ,' NNP')]),('',' IN'),'他',' PRP' ;),(' ', ''),('是',' VBD'),('荣幸',& #39; VBN'),("''","''"),(' to',' TO'),('',' VB'),('比较',' VBN& #39;),(' to',' TO'),树(' NE',[(' Darth',' ; NNP'),(' Vader',' NNP')]),(' while',' IN'),( ' in',' IN'),(' office',' NN'),('。', '&#39)])

在这里,副总统/ NNP,总统/ NNP(Dick / NNP,Cheney / NNP)被正确提取。

所以我认为如果首先使用nltk.ne_chunk,那么如果两个连续的树是NNP,那么两者都很有可能引用一个实体。

任何建议都会非常感激。我正在寻找我的方法中的缺陷。

感谢。

3 个答案:

答案 0 :(得分:16)

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)

[OUT]:

['Barack Obama']

但请注意,如果连续的块不应该是单个NE,那么您将把多个NE组合成一个。我无法想到这样的例子,但我确信它会发生。但如果它们不连续,上面的脚本运行正常:

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

答案 1 :(得分:5)

@ alvas的答案中有一个错误。 Fencepost错误。确保在循环外部运行那个elif检查,这样你就不会在句子末尾留下NE。所以:

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

txt = "Barack Obama is a great person and so is Michelle Obama." 
print get_continuous_chunks(txt)

答案 2 :(得分:0)

@alvas很棒的答案。这真的很有帮助。我试图以更实用的方式捕获您的解决方案。尽管如此,仍然需要改进。

    def conditions(tree_node):
    return tree_node.height() == 2

    def coninuous_entities(self, input_text, file_handle):
      from nltk import ne_chunk, pos_tag, word_tokenize
      from nltk.tree import Tree

      # Note: Currently, the chunker categorizes only 2 'NNP' together.  
      docs = input_text.split('\n')
      for input_text in docs:
          chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
          child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]

          named_entities = []
          for child in child_data:
              if type(child) == Tree:
                  named_entities.append(" ".join([token for token, pos in child.leaves()]))

          # Dump all entities to file for now, we will see how to go about that
          if file_handle is not None:
              file_handle.write('\n'.join(named_entities) + '\n')
      return named_entities

使用条件功能可以添加许多条件进行过滤。