NLTK句子标记符不正确

时间:2014-12-02 07:04:02

标签: nltk

我注意到NLTK sent_tokenizer在某些日期会出错。有没有办法调整它,以便它可以正确地标记以下内容:

valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.

目前正在运行sent_tokenize会导致:

['valid any day after january 1. not valid on federal holidays, including february 14, 
 or with other in-house events, specials, or happy hour.']

但它应该导致:

['valid any day after january 1.', 'not valid on federal holidays, including february 14, 
  or with other in-house events, specials, or happy hour.']

因为'1月1日'之后的时期是合法的句子终止字符。

1 个答案:

答案 0 :(得分:3)

首先,sent_tokenize函数使用punkt tokenizer来标记格式良好的英语句子。因此,通过包含正确的大小写可以解决您的问题:

>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

现在,让我们深入挖掘,Punkt tokenizer是Kiss and Strunk (2005)的算法,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py了解实现。

  

这个标记器通过使用一个文本将文本分成一个句子列表   用于构建缩写词模型的无监督算法,   搭配和开始句子的单词。 必须接受培训   目标语言中的大量明文,然后才能实现   使用。

所以在sent_tokenize的情况下,我很确定它是在一个结构良好的英语语料库上进行训练,因此事实上,在句号之后的大写是句子边界的强烈指示。因为我们有像i.e. , e.g.

这样的东西,所以可能不会是fullstop

在某些情况下,语料库可能包含01. put pasta in pot \n02. fill the pot with water之类的内容。使用训练数据中的这样的句子/文档,算法很可能认为跟随非捕获的单词的句子不是句子边界。

所以要解决这个问题,我建议如下:

  1. 手动细分10-20%的句子并重新训练语料库特定的标记器
  2. 在使用sent_tokenize
  3. 之前,将您的语料库转换为格式正确的拼写法

    另请参阅:training data format for nltk punkt