我想从文本文件中获取person的名称我使用nltk返回名称以及不是名称的单词:
def extract_names(text):
tokens = nltk.tokenize.word_tokenize(text)
pos = pos_tag(tokens)
sentt = ne_chunk(pos, binary = False)
person_list = []
person = []
name = ""
for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
for leaf in subtree.leaves():
person.append(leaf[0])
if len(person) > 1: #avoid grabbing lone surnames
for part in person:
name += part + ' '
name = remove_useless_name(name)
if name[:-1] not in person_list:
person_list.append(name[:-1])
name = ''
person = []
return person_list
我想删除那个不是名字的单词,我应该使用哪个方法来删除单词。 输入如
"Sunder Pichai"
"View Profile"
"Risk Management"
示例输出:
"Sunder Pichai"
答案 0 :(得分:0)
也许使用字典,并检查名称的所有部分是否是真正的单词和/或姓氏是否为已知名称