Question

我将使用nltk.sent_token完成任务。但我不知道如何在句子拆分条件中添加新的定界符，例如'：'或'％'。

例如。

'\ n自然语言处理\ n来自维基百科，免费的百科全书。 aa。 Abc：他是坏男孩：番茄能治好病吗？ Unnnn不太好吃！你不这样认为吗？\ n'

from nltk import sent_tokenize

sample = '\nNatural language processing\nFrom Wikipedia, the free encyclopedia. aaa.    Abc: He is bad boy: Tomato is it healty? Unnnn Not so tasty! Dont you think so?\n'

sample_token = sent_tokenize(sample)

sample_token

# result

['\nNatural language processing\nFrom Wikipedia, the free encyclopedia.',
 'aaa.',
 'Abc: He is bad boy: Tomato is it healty?',
 'Unnnn Not so tasty!',
 'Dont you think so?']

# what I want 

['\nNatural language processing\nFrom Wikipedia, the free encyclopedia.',
 'aaa.',
 'Abc: ',
 'He is bad boy: Tomato is it healty?',
 'Unnnn Not so tasty!',
 'Dont you think so?']

对不起，奇怪的一句话，我想添加（'：'delimiter + blank + Uppercase字母）作为nltk.sent_token的分割触发器。

请告诉我如何添加！谢谢！

如何在nltk.sent_tokenize中添加'：'分隔符？

0 个答案: