Question

我是pyhton和nltk的新手。我想标记一个字符串，并在nltk中将一些字符串添加到拆分列表中。我使用了帖子How to tweak the NLTK sentence tokenizer中的代码。以下是我写的代码

from nltk.tokenize import sent_tokenize
extra_abbreviations = ['\n']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

sent_tokenize_list = sentence_tokenizer(document)
sent_tokenize_list

这给了我以下错误：

TypeError Traceback（最近一次调用最后一次） in（） 4 sentence_tokenizer._params.abbrev_types.update（extra_abbreviations）五 ----＆GT; 6 sent_tokenize_list = sentence_tokenizer（文件） 7 sent_tokenize_list

TypeError：＆＃39; PunktSentenceTokenizer＆＃39;对象不可调用

我该如何解决这个问题？

Answer 1

这使您的示例有效：

import nltk
from nltk.tokenize import sent_tokenize
extra_abbreviations = ['\n']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
document = """This is my test doc. It has two sentences; however, one of wich with interesting punctuation."""
sent_tokenize_list = sentence_tokenizer.tokenize(document)
print(sent_tokenize_list)

您的错误是由于sentence_tokenizer是一个对象。您必须在对象上调用函数tokenize。

了解如何详细了解对象in the python docs

的功能

＆＃39; PunktSentenceTokenizer＆＃39;对象不可调用

这给了我以下错误：

1 个答案: