Question

我正试图使包含一些注释的文本标记化，以使其保持完整并作为标记处理，但是在添加了正则表达式以匹配此类标记后，所得标记包括特殊字符，例如{ {1}}

如何在保留所需标记的同时摆脱标记中的特殊字符？

例如，短语?被标记为

Can I have a {@pet}?

请注意，我的令牌保持不变，但是即使我的正则表达式不匹配，问号仍然存在。

这就是我的做法

can = can - (VERB)
i = i - (PRON)
have = have)
a = a - (DET)
{@pet}? (NOUN)

并像这样启动我的令牌生成器

def custom_tokenizer(nlp):
    annotations = r"(\{\@[a-z]+\})"
    annotation_pattern = re.compile(annotations)
    all_prefixes_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
    infix_re = spacy.util.compile_infix_regex(nlp.Defaults.infixes)
    suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)
    return Tokenizer(nlp.vocab, nlp.Defaults.tokenizer_exceptions,
                     prefix_search=all_prefixes_re.search,
                     infix_finditer=infix_re.finditer, suffix_search=suffix_re.search,
                     token_match=annotation_pattern.match)

我使用默认后缀re，因为nlp = spacy.load('en_core_web_lg', disable=['ner']) nlp.tokenizer = custom_tokenizer(nlp)字符通常会被正确标记化

在Spacy中使用注释（不含特殊字符）进行标记

0 个答案: