我注意到NLTK sent_tokenizer在某些日期会出错。有没有办法调整它,以便它可以正确地标记以下内容:
valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.
目前正在运行sent_tokenize会导致:
['valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.']
但它应该导致:
['valid any day after january 1.', 'not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.']
因为'1月1日'之后的时期是合法的句子终止字符。
答案 0 :(得分:3)
首先,sent_tokenize
函数使用punkt tokenizer来标记格式良好的英语句子。因此,通过包含正确的大小写可以解决您的问题:
>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
现在,让我们深入挖掘,Punkt tokenizer是Kiss and Strunk (2005)的算法,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py了解实现。
这个标记器通过使用一个文本将文本分成一个句子列表 用于构建缩写词模型的无监督算法, 搭配和开始句子的单词。 必须接受培训 目标语言中的大量明文,然后才能实现 使用。
所以在sent_tokenize
的情况下,我很确定它是在一个结构良好的英语语料库上进行训练,因此事实上,在句号之后的大写是句子边界的强烈指示。因为我们有像i.e. , e.g.
在某些情况下,语料库可能包含01. put pasta in pot \n02. fill the pot with water
之类的内容。使用训练数据中的这样的句子/文档,算法很可能认为跟随非捕获的单词的句子不是句子边界。
所以要解决这个问题,我建议如下:
sent_tokenize