Question

我注意到NLTK sent_tokenizer在某些日期会出错。有没有办法调整它，以便它可以正确地标记以下内容：

valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.

目前正在运行sent_tokenize会导致：

['valid any day after january 1. not valid on federal holidays, including february 14, 
 or with other in-house events, specials, or happy hour.']

但它应该导致：

['valid any day after january 1.', 'not valid on federal holidays, including february 14, 
  or with other in-house events, specials, or happy hour.']

因为'1月1日'之后的时期是合法的句子终止字符。

Answer 1

首先，sent_tokenize函数使用punkt tokenizer来标记格式良好的英语句子。因此，通过包含正确的大小写可以解决您的问题：

>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

现在，让我们深入挖掘，Punkt tokenizer是Kiss and Strunk (2005)的算法，请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py了解实现。

这个标记器通过使用一个文本将文本分成一个句子列表用于构建缩写词模型的无监督算法，搭配和开始句子的单词。 必须接受培训目标语言中的大量明文，然后才能实现使用。

所以在sent_tokenize的情况下，我很确定它是在一个结构良好的英语语料库上进行训练，因此事实上，在句号之后的大写是句子边界的强烈指示。因为我们有像i.e. , e.g.

这样的东西，所以可能不会是fullstop

在某些情况下，语料库可能包含01. put pasta in pot \n02. fill the pot with water之类的内容。使用训练数据中的这样的句子/文档，算法很可能认为跟随非捕获的单词的句子不是句子边界。

所以要解决这个问题，我建议如下：

手动细分10-20％的句子并重新训练语料库特定的标记器
在使用sent_tokenize

另请参阅：training data format for nltk punkt

NLTK句子标记符不正确

1 个答案: