我们想使用PhraseMatcher匹配一组短语。但是,我们不仅要匹配逐字记录的文本,还要匹配输入的规范化版本。例如,小写字母,去掉重音符号等等。
我们尝试向令牌添加自定义外观,并在PhraseMatcher的init中使用它来匹配它,但是它不起作用。
我们可以使用自定义管道来转换文本,但我们希望保留原始文本以能够使用spacy的其他组件。
def deaccent(text):
...
return modified_text
def get_normalization(doc):
return deaccent(doc.text)
Token.set_extension('get_norm', getter=get_normalization)
patterns_ = [{"label": "TECH", "pattern": "java"}]
ruler = EntityRuler(nlp, phrase_matcher_attr="get_norm")
ruler.add_patterns(patterns_)
nlp.add_pipe(ruler)
这是怎么做的?
答案 0 :(得分:1)
由于EntityRuler基于PhraseMatcher,因此我在此处复制Spacy v2.2.0的工作示例。请遵循注释以了解如何使用令牌中的“ NORM”属性。
最后,您将看到单词“FÁCIL”与模式“ facil”匹配的方式,
import re
import spacy
from unicodedata import normalize
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from spacy.lang.es import Spanish
# Define our custom pipeline component that overwrites the custom attribute "norm" from tokens
class Deaccentuate(object):
def __init__(self, nlp):
self._nlp = nlp
def __call__(self, doc):
for token in doc:
token.norm_ = self.deaccent(token.lower_) # write norm_ attribute!
return doc
@staticmethod
def deaccent(text):
""" Remove accentuation from the given string """
text = re.sub(
r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1",
normalize("NFD", text), 0, re.I
)
return normalize("NFC", text)
nlp = Spanish()
# Add component to pipeline
custom_component = Deaccentuate(nlp)
nlp.add_pipe(custom_component, first=True, name='normalizer')
# Initialize matcher with patterns to be matched
matcher = PhraseMatcher(nlp.vocab, attr="NORM") # match in norm attribute from token
patterns_ = nlp.pipe(['facil', 'dificil'])
matcher.add('MY_ENTITY', None, *patterns_)
# Run an example and print results
doc = nlp("esto es un ejemplo FÁCIL")
matches = matcher(doc)
for match_id, start, end in matches:
span = Span(doc, start, end, label=match_id)
print("MATCHED: " + span.text)
此错误已在v2.1.8版本中修复 https://github.com/explosion/spaCy/issues/4002