仅提取名称

时间:2018-08-22 03:59:17

标签: regex python-3.x nlp spacy data-extraction

我有一个如下所示的数据,我只需要提取其中的名字。我可以知道怎么做吗?我正在使用Spacy解决此问题,并使用了实体label _ ==“ PERSON”,但是当我们使用单个名称

时,这种方法会失败
Ordered by: Potter

数据如下所示

Data="""

Ordered by: Jacob Green
Ordered by: nurse
Ordered by: doctor
Ordered by: Potter
Ordered by: MD
Ordered by: Doctor
Ordered by Morgan Olivia
Ordered by a physician
Ordered by: Dr. Ali Zafar    
"""        

预期输出:

Jacob Green
Potter
Morgan Olivia
Ali Zafar

2 个答案:

答案 0 :(得分:2)

我想您最好的办法是假设一个人是某行结尾处的一个人,该人的一个或多个名称(由空格分隔)以大写字母开头且仅包含小写字母:

import re
lines = Data.split('\n')

# You may want to play with the definition of a "name"
is_name = re.compile(r'([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s*$')

[is_name.findall(line) for line in lines]
#[['Jacob Green'],
# [],
# [],
# ['Potter'],
# ['Morgan Olivia'],
# [],
# ['Ali Zafar']]

答案 1 :(得分:0)

尽管在这种情况下NER失败,但POS标记仍将其标记为PROPN(专有名词)。也许您可以将该功能用作对任何不产生命名实体的行的双重检查?

Data="""
Ordered by: Potter
"""

doc = nlp(Data)

for token in doc:
    print(token.text + ' ' + token.pos_)

 SPACE
Ordered VERB
by ADP
: PUNCT
Potter PROPN