我正在尝试从字符串中删除停用词,但我要实现的条件是不应删除字符串中的命名实体。
import spacy
nlp = spacy.load('en_core_web_sm')
text = "The Bank of Australia has an agreement according to the Letter Of Offer which states that the deduction should be made at the last date of each month"
doc = nlp(text)
如果我检查文本中的命名实体,则会得到以下内容
print(doc.ents)
(The Bank of Australia, the Letter Of Offer, the last date of each month)
删除停用词的常用方法如下所示
[token.text for token in doc if not token.is_stop]
['Bank',
'Australia',
'agreement',
'according',
'Letter',
'Offer',
'states',
'deduction',
'date',
'month']
正常方法完全消除了我的任务所需的含义。 我想保留命名实体。
我尝试添加具有相同列表的命名实体。
list1 = [token.text for token in doc if not token.is_stop]
list2 = [str(a) for a in doc.ents]
list1 + list2
['Bank',
'Australia',
'agreement',
'according',
'Letter',
'Offer',
'states',
'deduction',
'date',
'month',
'The Bank of Australia',
'the Letter Of Offer',
'the last date of each month']
还有其他方法吗?
答案 0 :(得分:1)
您可以使用token.ent_iob_
或token.ent_type_
(参见API documentation)在令牌级别上检查它是否是实体的一部分。所以您可能想要这样的东西:
print([token.text for token in doc if token.ent_type_ or not token.is_stop])
返回
[“ The”,“ Bank”,“ of”,“ Australia”,“ agreement”,“ according”,“ the”,“ Letter”,“ Of”,“ Offer”,“ states”,“ deduction” ','the','last','date','of','each','month']