我有一个如下所示的数据,我只需要提取其中的名字。我可以知道怎么做吗?我正在使用Spacy解决此问题,并使用了实体label _ ==“ PERSON”,但是当我们使用单个名称
时,这种方法会失败Ordered by: Potter
数据如下所示
Data="""
Ordered by: Jacob Green
Ordered by: nurse
Ordered by: doctor
Ordered by: Potter
Ordered by: MD
Ordered by: Doctor
Ordered by Morgan Olivia
Ordered by a physician
Ordered by: Dr. Ali Zafar
"""
预期输出:
Jacob Green
Potter
Morgan Olivia
Ali Zafar
答案 0 :(得分:2)
我想您最好的办法是假设一个人是某行结尾处的一个人,该人的一个或多个名称(由空格分隔)以大写字母开头且仅包含小写字母:>
import re
lines = Data.split('\n')
# You may want to play with the definition of a "name"
is_name = re.compile(r'([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s*$')
[is_name.findall(line) for line in lines]
#[['Jacob Green'],
# [],
# [],
# ['Potter'],
# ['Morgan Olivia'],
# [],
# ['Ali Zafar']]
答案 1 :(得分:0)
尽管在这种情况下NER失败,但POS标记仍将其标记为PROPN(专有名词)。也许您可以将该功能用作对任何不产生命名实体的行的双重检查?
Data="""
Ordered by: Potter
"""
doc = nlp(Data)
for token in doc:
print(token.text + ' ' + token.pos_)
SPACE
Ordered VERB
by ADP
: PUNCT
Potter PROPN