我正在尝试将文本文件读入Python,然后执行句子分段器,单词标记器和词性标记器。
这是我的代码:
file=open('C:/temp/1.txt','r')
sentences = nltk.sent_tokenize(file)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
当我尝试第二个命令时,它显示错误:
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
sentences = nltk.sent_tokenize(file)
File "D:\Python\lib\site-packages\nltk\tokenize\__init__.py", line 76, in sent_tokenize
return tokenizer.tokenize(text)
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1217, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1262, in sentences_from_text
sents = [text[sl] for sl in self._slices_from_text(text)]
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1269, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
另一个尝试: 当我只尝试一句话,例如&#34;一只黄色的狗吠叫着猫?#34; 前三个命令有效,但最后一行,我收到了这个错误:(我想知道我是不是完全下载了包?)
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
sentences = [nltk.pos_tag(sent) for sent in sentences]
File "D:\Python\lib\site-packages\nltk\tag\__init__.py", line 99, in pos_tag
tagger = load(_POS_TAGGER)
File "D:\Python\lib\site-packages\nltk\data.py", line 605, in load
resource_val = pickle.load(_open(resource_url))
ImportError: No module named numpy.core.multiarray
答案 0 :(得分:2)
嗯......你确定错误在第二行行吗?
您似乎使用标准ASCII '
和,
字符以外的单引号和逗号字符:
file=open(‘C:/temp/1.txt’,‘r’) # your version (WRONG)
file=open('C:/temp/1.txt', 'r') # right
Python甚至不能编译它。实际上,当我尝试它时,由于语法错误而导致barfs。
更新:您发布了具有正确语法的更正版本。回溯中的错误消息非常简单:您调用的函数似乎期望将一大块文本作为其参数,而不是文件对象。虽然我对NLTK一无所知,但在Google confirms this上花了五秒钟。
尝试这样的事情:
file = open('C:/temp/1.txt','r')
text = file.read() # read the contents of the text file into a variable
result1 = nltk.sent_tokenize(text)
result2 = [nltk.word_tokenize(sent) for sent in result1]
result3 = [nltk.pos_tag(sent) for sent in result2]
更新:我将sentences
重命名为result
1/2/3,因为由于反复覆盖同一个变量而导致代码实际执行的内容混淆不清。 不会影响语义,只是说明第二行实际上对最终result3
有影响。
答案 1 :(得分:0)
首先打开文件,然后阅读:
filename = 'C:/temp/1.txt'
infile = open(filename, 'r')
text = infile.read()
然后将工具链接到nltk中:
tagged_words = [pos_tag(word_tokenize(i) for i in sent_tokenize(text)]