Question

我试图将由正则表达式定义的实体添加到SpaCy的NER管道中。理想情况下，我应该能够使用从具有定义的实体类型的json文件加载的任何正则表达式。例如，我试图执行以下代码。

下面的代码显示了我正在尝试做的事情，下面是Spacy讨论的有关使用正则表达式的自定义属性的示例。我尝试以各种方式（对Doc，Span，Token）调用“ set_extension”方法，但无济于事。我什至不确定我应该将它们设置为什么。

    nlp = spacy.load("en_core_web_lg")
    matcher = Matcher(nlp.vocab)
    pattern = [{"_": {"country": {"REGEX": "^[Uu](\.?|nited) ?[Ss](\.|tates)$"}}}]
    matcher.add("US", None, pattern)
    doc = nlp(u"I'm from the United States.")
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
       span = doc[start:end]
       print(match_id, string_id, start, end, span.text)

我希望match_id, string_id 3 4 United States被打印出来。

相反，我得到AttributeError: [E046] Can't retrieve unregistered extension attribute 'country'. Did you forget to call the 'set_extension' method?

Answer 1

这里有关于扩展属性的文档：https://spacy.io/usage/processing-pipelines#custom-components-attributes

基本上，您必须将此country变量定义为扩展属性，如下所示：

Token.set_extension("country", default="")

但是，在您引用的代码中，您实际上从未将_.country属性设置为任何令牌（或跨度），因此它们都仍为默认值，并且匹配器将永远无法获取他们的比赛。您引用的行：

pattern = [{"_": {"country": {"REGEX": "^[Uu](\.?|nited) ?[Ss](\.?|tates)$"}}}]

尝试在自定义属性值上匹配美国正则表达式，而不是您期望的那样在文档文本上匹配（我认为）。

一种解决方案是直接在文本上运行reg-exps：

nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}},
           {"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}}]
matcher.add("US", None, pattern)
doc = nlp(u"I'm from the United States.")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

哪个输出

15397641858402276818美国4 6美国

然后，您可以将这些匹配项用于在跨度或令牌（在此情况下为跨度，因为您的匹配可能涉及多个令牌）上设置自定义属性

将REGEX实体添加到SpaCy的Matcher

1 个答案: