PhraseMatcher匹配不同的令牌属性

时间:2019-07-19 09:27:22

标签: nlp spacy

我们想使用PhraseMatcher匹配一组短语。但是,我们不仅要匹配逐字记录的文本,还要匹配输入的规范化版本。例如,小写字母,去掉重音符号等等。

我们尝试向令牌添加自定义外观,并在PhraseMatcher的init中使用它来匹配它,但是它不起作用。

我们可以使用自定义管道来转换文本,但我们希望保留原始文本以能够使用spacy的其他组件。

def deaccent(text):
    ...
    return modified_text


def get_normalization(doc):
    return deaccent(doc.text)


Token.set_extension('get_norm', getter=get_normalization)


patterns_ = [{"label": "TECH", "pattern": "java"}]
ruler = EntityRuler(nlp, phrase_matcher_attr="get_norm")
ruler.add_patterns(patterns_)


nlp.add_pipe(ruler)

这是怎么做的?

1 个答案:

答案 0 :(得分:1)

由于EntityRuler基于PhraseMatcher,因此我在此处复制Spacy v2.2.0的工作示例。请遵循注释以了解如何使用令牌中的“ NORM”属性。

最后,您将看到单词“FÁCIL”与模式“ facil”匹配的方式,

import re
import spacy 
from unicodedata import normalize
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from spacy.lang.es import Spanish

# Define our custom pipeline component that overwrites the custom attribute "norm" from tokens
class Deaccentuate(object):
    def __init__(self, nlp):
        self._nlp = nlp

    def __call__(self, doc):
        for token in doc:
            token.norm_ = self.deaccent(token.lower_)  # write norm_ attribute!
        return doc

    @staticmethod
    def deaccent(text):
        """ Remove accentuation from the given string """
        text = re.sub(
            r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1",
            normalize("NFD", text), 0, re.I
        )
        return normalize("NFC", text)


nlp = Spanish()
# Add component to pipeline
custom_component = Deaccentuate(nlp)
nlp.add_pipe(custom_component, first=True, name='normalizer')
# Initialize matcher with patterns to be matched 
matcher =  PhraseMatcher(nlp.vocab, attr="NORM")  # match in norm attribute from token
patterns_ = nlp.pipe(['facil', 'dificil'])
matcher.add('MY_ENTITY', None, *patterns_)

# Run an example and print results
doc = nlp("esto es un ejemplo FÁCIL")
matches = matcher(doc)
for match_id, start, end in matches:
    span = Span(doc, start, end, label=match_id)
    print("MATCHED: " + span.text)

此错误已在v2.1.8版本中修复 https://github.com/explosion/spaCy/issues/4002