在spaCy的Matcher中使用自定义令牌扩展

时间:2020-07-26 10:13:35

标签: python methods nlp spacy matcher

我刚刚在spaCy中为 public static void Main(string[] args) { var host = CreateHostBuilder(args).Build(); var serv = host.Services.GetRequiredService<StoreContext>(); // do some code here host.Run(); } 添加了以下扩展名:

Token

因此,我想检查令牌是否具有某个指定的依赖项名称作为其子级之一,因此请执行以下操作:

from spacy.tokens import Token
has_dep = lambda token,name: name in [child.dep_ for child in token.children]
Token.set_extension('HAS_DEP', method=has_dep)

输出doc = nlp(u'We are walking around.') walking = doc[2] walking._.HAS_DEP('nsubj') ,因为'walking'有一个孩子,其依赖项标签为'nsubj'(即单词'we')。

但是,我不知道如何在spaCy的Matcher中使用此扩展名。下面是我写的。我期望的输出为True,但似乎不起作用:

walking

2 个答案:

答案 0 :(得分:1)

我认为您可以通过getter来实现您的目标:

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
has_dep = lambda token: 'nsubj' in [child.dep_ for child in token.children]
Token.set_extension('HAS_DEP_NSUBJ', getter=has_dep, force=True)

nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
matcher.add("depnsubj", None, [{"_": {"HAS_DEP_NSUBJ": True}}])

doc = nlp("We're walking around the house.")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]
    print(span)

walking

答案 1 :(得分:0)

我认为您可以使用doc.retokenize()token.head来代替,如下所示:

from spacy.matcher import Matcher
import en_core_web_sm

nlp = en_core_web_sm.load()

matcher = Matcher(nlp.vocab)
pattern = [{'DEP': 'nsubj'}]
matcher.add("depnsubj", None, pattern)

doc = nlp("We're walking around the house.")
matches = matcher(doc)

matched_spans = []
for match_id, start, end in matches:
    span = doc[start:end]
    matched_spans.append(doc[start:end])

matched_tokens = []
with doc.retokenize() as retokenizer:
    for span in spans:
        retokenizer.merge(span)
        for token in span:
            print(token.head)

输出:

walking