Question

SpaCy版本：2.0.11

Python版本：3.6.5

操作系统：Ubuntu 16.04

我的句子示例：

Marketing-Representative- won't die in car accident.

或

Out-of-box implementation

预期令牌：

["Marketing-Representative", "-", "wo", "n't", "die", "in", "car", "accident", "."]

["Out-of-box", "implementation"]

SpaCy令牌（默认令牌生成器）：

["Marketing", "-", "Representative-", "wo", "n't", "die", "in", "car", "accident", "."]

["Out", "-", "of", "-", "box", "implementation"]

我尝试创建自定义令牌生成器，但无法使用tokenizer_exceptions（下面的代码）处理spaCy处理的所有边缘情况：

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Marketing-Representative- won't die in car accident.")
for token in doc:
    print(token.text)

输出：

Marketing-Representative-
won
'
t
die
in
car
accident
.

我需要有人来指导我采取适当的方法。

在上面的正则表达式中进行更改都可以执行此操作，也可以使用任何其他方法，或者我什至尝试过spaCy的基于规则的匹配器，但无法创建规则来处理两个以上字词之间的连字符，例如“开箱即用”，以便可以创建匹配器以与span.merge（）一起使用。

无论哪种方式，我都需要让包含单词内连字符的单词成为由Stanford CoreNLP处理的单个标记。

Answer 1

尽管未在spacey usage site中记录，

看起来我们只需要为正在使用的* fix添加regex，在本例中为infix。

此外，看来我们可以使用自定义nlp.Defaults.prefixes来扩展regex

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

这将给您想要的结果。无需将默认设置为prefix和suffix，因为我们不使用它们。

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])

结果

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']

您可能需要修复插件正则表达式，以使其对与所应用正则表达式接近的其他类型的令牌更加健壮。

Answer 2

我还想修改 spaCy 的分词器以更接近 CoreNLP 的语义。下面粘贴的是我想出的，解决了这个线程中的连字符问题（包括尾随连字符）和一些额外的修复。我不得不复制默认的中缀表达式并对其进行修改，但能够简单地附加一个新的后缀表达式：


import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS

def initializeTokenizer(nlp):

    prefixes = nlp.Defaults.prefixes 
    
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r'(?<=[0-9])[+\-\*^](?=[0-9-])',
            r'(?<=[{al}{q}])\.(?=[{au}{q}])'.format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            # REMOVE: commented out regex that splits on hyphens between letters:
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            # EDIT: remove split on slash between letters, and add comma
            #r'(?<=[{a}0-9])[:<>=/](?=[{a}])'.format(a=ALPHA),
            r'(?<=[{a}0-9])[:<>=,](?=[{a}])'.format(a=ALPHA),
            # ADD: ampersand as an infix character except for dual upper FOO&FOO variant
            r'(?<=[{a}0-9])[&](?=[{al}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
            r'(?<=[{al}0-9])[&](?=[{a}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
        ]
    )

    # ADD: add suffix to split on trailing hyphen
    custom_suffixes = [r'[-]']
    suffixes = nlp.Defaults.suffixes
    suffixes = tuple(list(suffixes) + custom_suffixes)

    infix_re = spacy.util.compile_infix_regex(infixes)
    suffix_re = spacy.util.compile_suffix_regex(suffixes)

    nlp.tokenizer.suffix_search = suffix_re.search
    nlp.tokenizer.infix_finditer = infix_re.finditer

为什么spaCy不能像Stanford CoreNLP那样在标记化过程中保留单词内的连字符？

2 个答案: