Question

import re
import spacy
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet

inputfile = open('file.txt', 'r')
String= inputfile.read()
nlp = spacy.load('en_core_web_sm')

def candidate_name_extractor(input_string, nlp):
    input_string = str(input_string)

    doc = nlp(input_string)
    print(doc)
    # Extract entities
    doc_entities = doc.ents
    #print(doc_entities)
    # Subset to person type entities
    doc_persons = filter(lambda x: x.label_ == 'PERSON', doc_entities)
    doc_persons = filter(lambda x: len(x.text.strip().split()) >= 2, doc_persons)
    doc_persons = list(map(lambda x: x.text.strip(), doc_persons))
    candidate_name = doc_persons[0]
    return candidate_name

if __name__ == '__main__':
    names = candidate_name_extractor(String, nlp)

print(names)

当我运行此程序时，print nlp(input_string)然后工作正常。但是当它执行第"doc_entities=doc.ents"行时，

删除包含名称mobile和email的上三行我想从中提取名称。

问题出在哪里问题"doc_entities = doc.ents"无法正常运行或是什么？

Answer 1

所以我稍微改写了你的例子，使其变得更加直截了当，而且它对我来说很好。我认为您遇到的许多问题可能与嵌套过滤和字符串操作有关 - 您可以轻松删除它们，因为spaCy已经为您完成了所有这些。

在这里，我只是迭代文本中找到的实体，如果实体标签是PERSON，并且实体包含两个或更多令牌，我们假设它是一个人名并打印实体文本。 doc.ents中的每个条目都是Span个对象，其长度是标记的长度。因此无需拆分字符串 - 您只需使用len(ent)。

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"John Smith")  # or whatever else

for ent in doc.ents:
    if ent.label_ == 'PERSON' and len(ent) >= 2:
        print('The candidate name is: ', ent.text)  # do something with the entity

当然，如果spaCy的模型实际上将该名称识别为实体，则只返回结果。默认模型是针对通用新闻和网络文本进行培训的，因此如果您正在处理简历，如果不对用例进行微调，则可能无法获得完美的结果。

您的问题并不完全清楚您要对电话号码或电子邮件地址做什么 - 但所有这些信息仍会在您的doc中提供。例如，假设您还想提取电子邮件地址，最简单的方法是使用令牌上的like_email属性：

doc = nlp(u"John Smith, john@smith.com")
email_addresses = [token.text for token in doc if token.like_email]
# ['john@smith.com']

当我发挥功能时，为什么.ents没有正确的方法

1 个答案: