Question

给出一个字符串：

c = 'A problem. She said: "I don\'t know about it."'

试图将其标记化：

>>> for sindex,sentence in enumerate(sent_tokenize(c)):
...     print str(sindex)+": "+sentence
...
0: A problem.
1: She said: "I don't know about it.
2: "
>>>

为什么NLTK将句子2的结尾引用到自己的句子3中？有没有办法纠正这种行为？

Answer 1

而不是默认的sent_tokenize，您需要的是已在punkt句子标记器中预先训练的预编码功能。

>>> import nltk
>>> st2 = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent = 'A problem. She said: "I don\'t know about it."'
>>> st2.tokenize(sent, realign_boundaries=True)
['A problem.', 'She said: "I don\'t know about it."']

请参阅http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html

中的6 Punkt Tokenizer部分

Answer 2

默认句子标记器是PunktSentenceTokenizer，每次发现一个句点时都会检测到一个新句子，例如，句号属于美国的缩写词。

在nltk文档中，有一些示例说明如何训练具有不同语料库的新句子分割器。你可以找到它here.

所以我猜你的问题不能通过默认的句子标记器解决，你必须训练一个新的并尝试。

为什么NLTK在句子末尾错误地标记引用？

2 个答案: