spaCy nlp - 字符串中实体的位置,提取附近的单词

时间:2018-05-17 08:07:07

标签: spacy ner

让我们说我有一个字符串,并希望标记一些实体,如组织。

string = I was working as a marketing executive for Bank of India, a 4 months..

string_tagged = I was working as a marketing executive for [Bank of India], a 4 months..

我想识别标记的实体旁边的单词。 如何找到标记的实体的位置并提取实体旁边的单词?

我的代码:

import spacy    
nlp = spacy.load('en')
doc = nlp(string)
company = doc.text
for ent in doc.ents:
    if ent.label_ == 'ORG':
        company = company[:ent.start_char] + company[:ent.start_char -1] +company[:ent.end_char +1] 
print company 

2 个答案:

答案 0 :(得分:1)

正如我从您的问题中所理解的那样,您需要ORG标记的标记旁边的单词:

import spacy    
nlp = spacy.load('en')
#string = "blah blah"
doc = nlp(string)
company = ""
for i in range (1, len(doc)-1)):
    if doc[i].ent.label_ == 'ORG':
        company = doc[i-1] + doc[i] + doc[i+1] # previous word, tagged word and next one            
print company 

注意第一个和最后一个令牌检查。

答案 1 :(得分:0)

以下代码适用于我:

doc = nlp(str_to_be_tokenized)
company = []
for ent in doc.ents:
     if ent.label_ == 'ORG' and ent.text not in company:
         company.append(ent.text)
print(company)

if中的第二个条件是在我的文本块中仅提取唯一的公司名称。如果你删除它,你将获得' ORG'的所有实例。已添加到您的公司列表中。希望这也适合你