NLTK用对话框将文本标记为句子

时间:2017-09-30 04:00:23

标签: python nltk

我能够将非对话文本标记为句子,但是当我在句子中添加引号时,NLTK标记器不能正确地将它们拆分。例如,这可以按预期工作:

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)

这导致三个不同句子的列表:

['Is this one sentence?', 'This is separate.', 'This is a third he said.']

但是,如果我把它变成对话,那么同样的过程就不起作用了。

text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)

将其作为单句返回:

['“Is this one sentence?” “This is separate.” “This is a third” he said.']

如何在这种情况下使NLTK标记器工作?

1 个答案:

答案 0 :(得分:1)

似乎令牌化器不知道如何处理定向引号。用常规ASCII双引号替换它们,示例工作正常。

>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']