我是pyhton和nltk的新手。我想标记一个字符串,并在nltk中将一些字符串添加到拆分列表中。我使用了帖子How to tweak the NLTK sentence tokenizer中的代码。以下是我写的代码
from nltk.tokenize import sent_tokenize
extra_abbreviations = ['\n']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
sent_tokenize_list = sentence_tokenizer(document)
sent_tokenize_list
TypeError Traceback(最近一次调用最后一次) in() 4 sentence_tokenizer._params.abbrev_types.update(extra_abbreviations) 五 ----> 6 sent_tokenize_list = sentence_tokenizer(文件) 7 sent_tokenize_list
TypeError:' PunktSentenceTokenizer'对象不可调用
我该如何解决这个问题?
答案 0 :(得分:0)
这使您的示例有效:
import nltk
from nltk.tokenize import sent_tokenize
extra_abbreviations = ['\n']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
document = """This is my test doc. It has two sentences; however, one of wich with interesting punctuation."""
sent_tokenize_list = sentence_tokenizer.tokenize(document)
print(sent_tokenize_list)
您的错误是由于sentence_tokenizer
是一个对象。您必须在对象上调用函数tokenize
。
了解如何详细了解对象in the python docs
的功能