Question

我想包括连字符，例如：长期，自尊等，作为Spacy中的单个标记。在查看了Stackoverflow Github，其documentation和elsewhere上的一些类似文章之后，我还编写了一个自定义标记器，如下所示。

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

因此，这句话： '注：自十四世纪以来，“医学”的实践已成为一种职业；更重要的是，这是男性主导的职业。'

现在，合并自定义Spacy令牌生成器后的令牌为：

“注”，“：”，“自”，“ the”，“第十四”，“世纪”，“ the”，“实践”，“ of”，“ ”“药物”，'” ”，“具有”，“;”，“成为”，“ a”，“专业”，“，”，“和”，“更多”，“重要”，“， '，“是”，“ a”，“ 男性主导”，“专业”，“。”

之前，此更改之前的令牌为：

“注”，“：”，“自”，“该”，“第十四”，“世纪”，“该”，“实践”，“的”，“ ” ”， “ 药物”，“ ” ，“有”，“成为”，“一个”，“专业”，“;”，“和”，“更多”， “重要”，“，”，“ 它”，“ 的”，“ a”，“ 男性”，“ -”，“ 主导”，“专业”，“。”

而且，预期令牌应为：

“注”，“：”，“自”，“该”，“第十四”，“世纪”，“该”，“实践”，“的”，“ ” ”， “ 药物”，“ ” ，“有”，“成为”，“一个”，“专业”，“;”，“和”，“更多”， “重要”，“，”，“ 它”，“ 的”，“ a”，“ 男性主导”，“专业'，'。'

可以看到连字符被包括在内，除双引号和撇号外的其他标点符号也包括在内。但是现在，撇号和双引号没有更早的或预期的行为。我为Infix尝试了正则表达式的不同排列和组合，但没有解决此问题的进度。因此，我们将不胜感激任何帮助。

Answer 1

使用默认的prefix_re和suffix_re给了我预期的输出：

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

[“注意”，“：”，“自”，“该”，“第十四”，“世纪”，“该”，“实践”，“的”，“”，“药物”，“” '，'有'，'成为'，'a'，'专业'，';'，'和'，'更多'，'重要'，'，'，'它'，'s'，'a' ，“男性主导”，“专业”，“。”]

如果您想深入了解为什么正则表达式不能像SpaCy那样正常工作，请参见以下相关源代码的链接：

此处定义的前缀和后缀：

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

参考此处定义的字符（例如，引号，连字符等）：

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

以及用于编译它们的函数（例如compile_prefix_regex）：

https://github.com/explosion/spaCy/blob/master/spacy/util.py

使用Infix正则表达式

1 个答案: