我有一条文字:“字母是一家公司。它也是Google的背后。但是这些都不一样。”
根据应用时的spacy匹配器,它返回标签,开始,以匹配结尾。
现在,根据开始和结束,是否可以将范围扩展到整个句子(以句点结尾的单词)?
{
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
}
因此,我期望的输出如下所示。...
Entities[('Alphabet','myORG'),('Google','myORG')]
Entities[('Alphabet is a company','myORG'),('Also it is behind Google','myORG')]
我使用的代码:
{
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
def main(text="Alphabet is a company. Also it is behind Google. But these are not the same", *companies):
nlp = English()
if not companies:
companies = ['Alphabet', 'Google', 'Netflix', 'Apple']
component = myFindingsMatcher(nlp, companies)
nlp.add_pipe(component, last=True)
doc = nlp(text)
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
class myFindingsMatcher(object):
name = 'myFindings_matcher'
def __init__(self, nlp, companies=tuple(), label='myORG'):
patterns = [nlp(finding_type) for finding_type in companies]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
if __name__ == '__main__':
plac.call(main)
}
谢谢。
答案 0 :(得分:0)
属性e.sent
用于引用包含实体e
的句子。
带有预编译模型en_core_web_sm
及其内置Matcher的最小工作示例:
my_text = "Alphabet is a company. Also it is behind Google. But these are not the same"
nlp = spacy.load('en_core_web_sm')
doc = nlp(my_text)
print('Entities', [(e.text, e.label_, e.sent) for e in doc.ents])
这产生
实体[('Alphabet','ORG',Alphabet是一家公司。),('Google','ORG',它也位于Google后面。)]
如果您想使用nlp = English()
实现自己的匹配器,则必须添加一个模块来识别句子:
nlp.add_pipe(nlp.create_pipe('sentencizer'))
并且在定义实体范围时,必须确保正确设置e.sent
。请注意,通过查看句子的偏移量(计数标记),您可以轻松推断出正确的跨度:
print('Sentences', [(s.start, s.end, s.text) for s in doc.sents])
print('Entities', [(e.start, e.end, e.text, e.label_) for e in doc.ents])
会打印
句子[(0,5,'Alphabet是一家公司。'),(5,11,'也落后于Google。'),(11,17,'但这些都不相同']] < / p>
实体[(0,1,'Alphabet','ORG'),(9,10,'Google','ORG')]