NLTK Sentence Tokenzier:标点符号双引号问题

时间:2017-08-02 21:56:50

标签: python nltk text-mining sentence

NLTK PunktSentenceTokenizer没有正确找到句子的结尾。

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types.add(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'rev'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
sentence_splitter.tokenize(u'In that paper, "Has Financial Development Made the World Riskier?", Rajan "argued that disaster might loom." ')

输出继电器:

[u'In that paper, "Has Financial Development Made the World Riskier?"',
 u', Rajan "argued that disaster might loom."']
另一个:

sentence_splitter.tokenize(u'Don "Don C." Crowley')

输出:

[u'Don "Don C."', u'Crowley']

两个输入不应分成两个句子。有没有办法解决这个问题?

0 个答案:

没有答案