Question

I have a document

doc = nlp('x-xxmessage-id:')

When I want to extract the tokens of this one I get 'x', 'xx', 'message' and 'id', ':'. Everything goes well. Then I create a new document

test_doc = nlp('id')

If I try to extract the tokens of test_doc, I will get 'i' and 'd'. Is there any way to get past this problem? Because I want to get the same token as above and this is creating problems in the text processing.

Answer 1

就像语言本身一样，标记化依赖于上下文，language-specific data定义了告诉spaCy如何根据周围字符分割文本的规则。 spaCy的默认值也针对通用文本进行了优化，如新闻文本，网络文本和其他现代文字。

在您的示例中，您遇到了一个有趣的案例：抽象字符串"x-xxmessage-id:"在标点符号上拆分，而孤立的小写字符串"id"则拆分为"i"和{{ 1}}，因为在书面文本中，它最常见的是“我”或“我想”（“我能”，“我会”等）的替代拼写。您可以找到相应的规则here。

如果您正在处理与常规自然语言文本大不相同的特定文本，您通常需要customise the tokenization rules或者甚至可能为自己的自定义“方言”添加Language subclass。如果有一些固定数量的案例需要以不同的方式标记，可以用规则表示，另一种选择是将一个组件添加到管道merges the split tokens back together。

最后，您还可以尝试使用language-independent xx / MultiLanguage类。它仍然包含非常基本的标记化规则，例如分割标点符号，但没有特定于英语的规则。

"d"

Tokenization not working the same for both case.

1 个答案: