尝试使用matcher在spaCy上添加新实体时出现AssertionError

时间:2017-11-29 20:49:15

标签: named-entity-recognition spacy

我正在尝试匹配所有电子邮件,例如在一堆文档中查看文本,并将其添加到名为“EMAIL”的自定义NER标签中。 以下是测试用例的代码。

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

EMAIL = nlp.vocab.strings['EMAIL']

def add_email_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((EMAIL, start, end),)

matcher.add('EmailPII', add_email_ent, [{'LIKE_EMAIL': True}])

text = u"Hi, this is John. My email is john@ymail.com and an alternate is john@gmail.com"
doc = nlp(text)

matches = matcher(doc)
for i,[match_id, start, end] in enumerate(matches):
    print (i+1, doc[start:end])

for ent in doc.ents:
    print (ent.text, ent.label_)

这是我运行此代码时得到的结果。

Traceback (most recent call last):
  File "C:/Python27/emailpii.py", line 26, in <module>
    matches = matcher(doc)
  File "matcher.pyx", line 407, in spacy.matcher.Matcher.__call__
  File "C:/Python27/emailpii.py", line 19, in add_event_ent
    doc.ents += ((EMAIL, start, end),)
  File "doc.pyx", line 415, in spacy.tokens.doc.Doc.ents.__get__
  File "span.pyx", line 61, in spacy.tokens.span.Span.__cinit__
AssertionError: 17587345535198158200

然而,在运行类似的例子

import spacy


print "*****************"
print(spacy.__version__)
print "*****************"


from spacy.matcher import Matcher
#from spacy import displacy

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

EVENT = nlp.vocab.strings['EVENT']

def add_event_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((EVENT, start, end),)

matcher.add('GoogleIO', add_event_ent,
            [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
            [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])

text = u"Google I/O was great this year. See you all again in Google I/O 2018"
doc = nlp(text)

matches = matcher(doc)
for i,[match_id, start, end] in enumerate(matches):
    print (i, doc[start:end])

for ent in doc.ents:
    print (ent.text, ent.label_)

#displacy.serve(doc, style = 'ent')

我得到了所需的输出:

2.0.1

(0,Google I / O)

(1,Google I / O)

(2,Google I / O 2018)

(u'Google I / O',u'EVENT')

(今年','u'DATE')

(u'Google I / O 2018',u'EVENT')

我在这里错过了什么吗?

1 个答案:

答案 0 :(得分:0)

我认为您的第一个代码失败了,因为您没有为“EMAIL&#39;”添加实体标签。第二个代码有效,因为EVENT是一个预先存在的实体类型。

文档不清楚matcher.add()方法的第一个参数实际上是什么,但它为您添加了一个实体标签。以下是两种可行的方法,可以解决这个问题:

备选方案1:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

#EMAIL = nlp.vocab.strings['EMAIL'] #Not needed

def add_email_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((match_id, start, end),)

matcher.add('EMAIL', add_email_ent, [{'LIKE_EMAIL': True}])

text = u"Hi, this is John. My email is john@ymail.com and an alternate is john@gmail.com"
doc = nlp(text)

matches = matcher(doc)
for i,[match_id, start, end] in enumerate(matches):
    print (i+1, doc[start:end])

for ent in doc.ents:
    print (ent.text, ent.label_)

备选方案2(我不确定为什么你想这样做,因为你最终得到了两个实体标签,它们的用途基本相同,但只是为了说明目的而提供):

import spacy
from spacy.matcher import Matcher
from spacy.pipeline import EntityRecognizer

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
ner = EntityRecognizer(nlp.vocab)

ner.add_label('EMAIL')

EMAIL = nlp.vocab.strings['EMAIL']

def add_email_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((EMAIL, start, end),)

matcher.add('EmailPII', add_email_ent, [{'LIKE_EMAIL': True}])

text = u"Hi, this is John. My email is john@ymail.com and an alternate is john@gmail.com"
doc = nlp(text)

matches = matcher(doc)
for i,[match_id, start, end] in enumerate(matches):
    print (i+1, doc[start:end])

for ent in doc.ents:
    print (ent.text, ent.label_)