Question

我想使用spacy对Wikipedia剪贴簿进行标记化。理想情况下，它将像这样工作：

text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'

# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]

# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...

# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]

问题在于'假设'[2] [3]'被粘在一起成为一个令牌。

如何防止spacy将“ [2] [3]”连接到先前的令牌？只要它与假设单词和句子结尾的句点分开，我都不在乎它的处理方式。但是单个的单词和语法应该远离句法噪音。

因此，例如，以下任何一项将是理想的输出：

'hypotheses', '.', '[2][', '3]'
'hypotheses', '.', '[2', '][3]'

Answer 1

我认为您可以尝试使用infix：

import re
import spacy
from spacy.tokenizer import Tokenizer

infix_re = re.compile(r'''[.]''')

def custom_tokenizer(nlp):
  return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])

有关此https://spacy.io/usage/linguistic-features#native-tokenizers

的更多信息

spacy令牌化合并了错误的令牌

1 个答案: