让我们说我有一个字符串,并希望标记一些实体,如组织。
string = I was working as a marketing executive for Bank of India, a 4 months..
string_tagged = I was working as a marketing executive for [Bank of India], a 4 months..
我想识别标记的实体旁边的单词。 如何找到标记的实体的位置并提取实体旁边的单词?
我的代码:
import spacy
nlp = spacy.load('en')
doc = nlp(string)
company = doc.text
for ent in doc.ents:
if ent.label_ == 'ORG':
company = company[:ent.start_char] + company[:ent.start_char -1] +company[:ent.end_char +1]
print company
答案 0 :(得分:1)
正如我从您的问题中所理解的那样,您需要ORG
标记的标记旁边的单词:
import spacy
nlp = spacy.load('en')
#string = "blah blah"
doc = nlp(string)
company = ""
for i in range (1, len(doc)-1)):
if doc[i].ent.label_ == 'ORG':
company = doc[i-1] + doc[i] + doc[i+1] # previous word, tagged word and next one
print company
注意第一个和最后一个令牌检查。
答案 1 :(得分:0)
以下代码适用于我:
doc = nlp(str_to_be_tokenized)
company = []
for ent in doc.ents:
if ent.label_ == 'ORG' and ent.text not in company:
company.append(ent.text)
print(company)
if中的第二个条件是在我的文本块中仅提取唯一的公司名称。如果你删除它,你将获得' ORG'的所有实例。已添加到您的公司列表中。希望这也适合你