Question

我正在使用spacy做一些自定义的标记生成器。我想将API名称作为一个token。例如句子中的srcs[offset].remaining()：

Up to the first srcs[offset].remaining() bytes of this sequence are written from buffer srcs[offset].

我在 tokenizer 中添加了 token_match ，但是后缀将其覆盖。

代码如下所示：

import re
import spacy
from spacy.tokenizer import Tokenizer

def customize_tokenizer_api_name_recognition(nlp):
    # add api_name regex, such aaa.bbb(cccc) or aaa[bbb].ccc(ddd)
    api_name_match = re.compile("(\w+(\[[\w+-]+\])?\.)+\w+\(.*?\)", re.UNICODE).match
    nlp.tokenizer.token_match = api_name_match

if __name__ == '__main__':
    nlp = spacy.load('en_core_web_sm')
    customize_tokenizer_api_name_recognition(nlp)
    sentence = "Up to the first srcs[offset].remaining() bytes of this sequence are written from buffer srcs[offset]."
    doc = nlp(sentence)
    print([token.text for token in doc])
    # output:  ['Up', 'to', 'the', 'first', 'srcs[offset].remaining', '(', ')', 'bytes', 'of', 'this', 'sequence', 'are', 'written', 'from', 'buffer', 'srcs[offset', ']', '.']
    # expected output: ['Up', 'to', 'the', 'first', 'srcs[offset].remaining()', 'bytes', 'of', 'this', 'sequence', 'are', 'written', 'from', 'buffer', 'srcs[offset', ']', '.']

我已经看到一些相关的问题#4573，#4645和the doc。但是，所有示例都是简单的正则表达式，它们通过删除一些prefixs_search来解决。复杂的正则表达式呢？如何解决这类问题？

ps：我曾经使用doc.retokenize()来实现它。通过自定义令牌生成器可以更好地解决问题吗？

环境

操作系统：Win 10
使用的Python版本：python 3.7.3
使用的spaCy版本：spacy 2.2.3

Answer 1

除自定义后缀外，令牌生成器解决方案与＃4573相同。但是，这不是一个很好的解决方案，因为)在任何地方都不是后缀。

我认为doc.retokenize()的解决方案对于v2.2更好。您可以在添加到管道开头的小型自定义管道组件中进行操作。

（最好有两个token_match选项，一个带有前缀和后缀，一个不带前缀和后缀，但是也不能使令牌生成器选项比现在复杂得多）。

如何使用复杂的token_match自定义令牌生成器？

1 个答案: