查找字符串中相继连接的名词或代词

时间:2019-09-17 14:59:22

标签: python spacy pos-tagger

我想在文本中找到独立的或连续连接的名词。我把下面的代码放在一起,但这既不高效也不是pythonic。有人用spaCy用更Python的方式查找这些名词吗?

下面的代码用所有标记构建一个字典,然后遍历它们以查找独立或连接的PROPNNOUN,直到for循环超出范围。它返回收集到的物品的列表。

def extract_unnamed_ents(doc):
  """Takes a string and returns a list of all succesively connected nouns or pronouns""" 
  nlp_doc = nlp(doc)
  token_list = []
  for token in nlp_doc:
    token_dict = {}
    token_dict['lemma'] = token.lemma_
    token_dict['pos'] = token.pos_
    token_dict['tag'] = token.tag_
    token_list.append(token_dict)
  ents = []
  k = 0
  for i in range(len(token_list)):
    try:
      if token_list[k]['pos'] == 'PROPN' or token_list[k]['pos'] == 'NOUN':
        ent = token_list[k]['lemma']

        if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
          ent = ent + ' ' + token_list[k+1]['lemma']
          k += 1
          if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
            ent = ent + ' ' + token_list[k+1]['lemma']
            k += 1
            if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
              ent = ent + ' ' + token_list[k+1]['lemma']
              k += 1
              if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
                ent = ent + ' ' + token_list[k+1]['lemma']
                k += 1
        if ent not in ents:
          ents.append(ent)
    except:
      pass
    k += 1
  return ents

测试:

extract_unnamed_ents('Chancellor Angela Merkel and some of her ministers will discuss at a cabinet '
                     "retreat next week ways to avert driving bans in major cities after Germany's "
                     'top administrative court in February allowed local authorities to bar '
                     'heavily polluting diesel cars.')

出局:

['Chancellor Angela Merkel',
 'minister',
 'cabinet retreat',
 'week way',
 'ban',
 'city',
 'Germany',
 'court',
 'February',
 'authority',
 'diesel car']

1 个答案:

答案 0 :(得分:0)

spacy可以做到这一点,但我不确定它是否能为您提供确切的帮助

import spacy

text = """Chancellor Angela Merkel and some of her ministers will discuss
at a cabinet retreat next week ways to avert driving bans in
major cities after Germany's top administrative court
in February allowed local authorities to bar heavily
polluting diesel cars.
""".replace('\n', ' ')

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print([i.text for i in doc.noun_chunks])

给予

['Chancellor Angela Merkel', 'her ministers', 'a cabinet retreat', 'ways', 'driving bans', 'major cities', "Germany's top administrative court", 'February', 'local authorities', 'heavily polluting diesel cars']

在这里,但是i.lemma_行并没有真正提供您想要的内容(我认为这可能由this recent PR来解决)。

由于您可以像这样使用itertools.groupby之后就完全不一样了

import itertools

out = []
for i, j in itertools.groupby(doc, key=lambda i: i.pos_):
    if i not in ("PROPN", "NOUN"):
        continue
    out.append(' '.join(k.lemma_ for k in j))
print(out)

给予

['Chancellor Angela Merkel', 'minister', 'cabinet retreat', 'week way', 'ban', 'city', 'Germany', 'court', 'February', 'authority', 'diesel car']

这应该为您提供与函数完全相同的输出(此处的输出略有不同,但我相信这是由于spacy版本不同所致。)

如果您真的很冒险,可以使用列表理解

out = [' '.join(k.lemma_ for k in j) 
       for i, j in itertools.groupby(doc, key=lambda i: i.pos_) 
       if i in ("PROPN", "NOUN")]

请注意,使用不同的spacy版本时,我看到的结果略有不同。上面的输出来自spacy-2.1.8