应用错误收集

我需要将一些文本部分拆分成句子，我正在使用NLTK标记器来实现此目的。这些文本片段都是小写的，并且通常质量较差，这使得它更加困难。但是，只要语言的一般规则得到维护，就可以不时出现一些错误。例如，我希望句子在点后分割。文本部分可能包含许多单个句子以及缩写等。

如何确保NLTK忽略大写并将下面的文本拆分为“2006”之间的2个句子。和“虽然”？

from nltk.tokenize import sent_tokenize

print sent_tokenize('no drop in its quality as it got nearer to its end, in 2006. though i didn\'t like the movie much.')
>> ["no drop in its quality as it got nearer to its end, in 2006. though i didn't like the movie much."]

使NLTK标记生成器忽略大小写

0 个答案: