我需要将一些文本部分拆分成句子,我正在使用NLTK标记器来实现此目的。这些文本片段都是小写的,并且通常质量较差,这使得它更加困难。但是,只要语言的一般规则得到维护,就可以不时出现一些错误。例如,我希望句子在点后分割。文本部分可能包含许多单个句子以及缩写等。
如何确保NLTK忽略大写并将下面的文本拆分为“2006”之间的2个句子。和“虽然”?
from nltk.tokenize import sent_tokenize
print sent_tokenize('no drop in its quality as it got nearer to its end, in 2006. though i didn\'t like the movie much.')
>> ["no drop in its quality as it got nearer to its end, in 2006. though i didn't like the movie much."]