我想在文本中找到独立的或连续连接的名词。我把下面的代码放在一起,但这既不高效也不是pythonic。有人用spaCy用更Python的方式查找这些名词吗?
下面的代码用所有标记构建一个字典,然后遍历它们以查找独立或连接的PROPN
或NOUN
,直到for循环超出范围。它返回收集到的物品的列表。
def extract_unnamed_ents(doc):
"""Takes a string and returns a list of all succesively connected nouns or pronouns"""
nlp_doc = nlp(doc)
token_list = []
for token in nlp_doc:
token_dict = {}
token_dict['lemma'] = token.lemma_
token_dict['pos'] = token.pos_
token_dict['tag'] = token.tag_
token_list.append(token_dict)
ents = []
k = 0
for i in range(len(token_list)):
try:
if token_list[k]['pos'] == 'PROPN' or token_list[k]['pos'] == 'NOUN':
ent = token_list[k]['lemma']
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if ent not in ents:
ents.append(ent)
except:
pass
k += 1
return ents
测试:
extract_unnamed_ents('Chancellor Angela Merkel and some of her ministers will discuss at a cabinet '
"retreat next week ways to avert driving bans in major cities after Germany's "
'top administrative court in February allowed local authorities to bar '
'heavily polluting diesel cars.')
出局:
['Chancellor Angela Merkel',
'minister',
'cabinet retreat',
'week way',
'ban',
'city',
'Germany',
'court',
'February',
'authority',
'diesel car']
答案 0 :(得分:0)
spacy
可以做到这一点,但我不确定它是否能为您提供确切的帮助
import spacy
text = """Chancellor Angela Merkel and some of her ministers will discuss
at a cabinet retreat next week ways to avert driving bans in
major cities after Germany's top administrative court
in February allowed local authorities to bar heavily
polluting diesel cars.
""".replace('\n', ' ')
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print([i.text for i in doc.noun_chunks])
给予
['Chancellor Angela Merkel', 'her ministers', 'a cabinet retreat', 'ways', 'driving bans', 'major cities', "Germany's top administrative court", 'February', 'local authorities', 'heavily polluting diesel cars']
在这里,但是i.lemma_
行并没有真正提供您想要的内容(我认为这可能由this recent PR来解决)。
由于您可以像这样使用itertools.groupby
之后就完全不一样了
import itertools
out = []
for i, j in itertools.groupby(doc, key=lambda i: i.pos_):
if i not in ("PROPN", "NOUN"):
continue
out.append(' '.join(k.lemma_ for k in j))
print(out)
给予
['Chancellor Angela Merkel', 'minister', 'cabinet retreat', 'week way', 'ban', 'city', 'Germany', 'court', 'February', 'authority', 'diesel car']
这应该为您提供与函数完全相同的输出(此处的输出略有不同,但我相信这是由于spacy
版本不同所致。)
如果您真的很冒险,可以使用列表理解
out = [' '.join(k.lemma_ for k in j)
for i, j in itertools.groupby(doc, key=lambda i: i.pos_)
if i in ("PROPN", "NOUN")]
请注意,使用不同的spacy
版本时,我看到的结果略有不同。上面的输出来自spacy-2.1.8