Question

我正在尝试适当地分割单词以适合我的语料库。为此，我已经找到了以下内容：

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

修复连字符的单词时，我似乎无法弄清楚的是如何使单词带有撇号以进行收缩，例如：不能，不会，不，他是等一起作为一种象征。更具体地说，我正在搜索如何针对荷兰语单词：zo'n，auto，massa等，但是这个问题应该与语言无关。

我有以下标记器：

def custom_tokenizer(nlp):
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\'\`\“\”\"\'~]''')

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=None)
nlp = spacy.load('nl_core_news_sm')
nlp.tokenizer = custom_tokenizer(nlp)

与此相关的令牌是：

'Mijn'，'eigen'，'huis'，'staat'，'zo'，“'，'n'，'zes'，'meter'，'onder'， 'het'，'wateroppervlak'，'van'，'de'，'Noordzee'，'。'

令牌应为：

'Mijn'，'eigen'，'huis'，'staat'，“ zo'n” ，'zes'，'meter'，'onder'，'het'，'wateroppervlak '，'van'，'de'，'Noordzee'，'。'

我知道可以添加如下自定义规则：

case = [{ORTH: "zo"}, {ORTH: "'n", LEMMA: "een"}]
tokenizer.add_special_case("zo'n",case)

但是我正在寻找更通用的解决方案。

我尝试从另一个线程编辑infix_re正则表达式，但似乎对该问题没有任何影响。我可以做些设置或更改来解决此问题吗？任何帮助将不胜感激。

Answer 1

spaCy中有一项最新的工作正在进行中，用于为荷兰语修复这类词汇形式。今日的请求请求中的更多信息：https://github.com/explosion/spaCy/pull/3409

更具体地说，nl/punctuation.py（https://github.com/explosion/spaCy/pull/3409/files#diff-84f02ed25ff9e44641672ca0ba5c1839）显示了如何通过更改后缀来解决此问题：

spacy标记化撇号

1 个答案: