NLTK Python word_tokenize

时间:2018-03-25 12:29:18

标签: python nltk text-mining

我加载了一个包含6000行句子的txt文件。我尝试过word_tokenizeTraceback (most recent call last): File "final.py", line 52, in <module> short_pos_words = word_tokenize(short_pos) File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize sentences = [text] if preserve_line else sent_tokenize(text, language) File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 95, in sent_tokenize return tokenizer.tokenize(text) File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1237, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize return [(sl.start, sl.stop) for sl in slices] File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries for sl1, sl2 in _pair_iter(slices): File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter for el in it: File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text if self.text_contains_sentbreak(context): File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak for t in self._annotate_tokens(self._tokenize_words(text)): File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass for t1, t2 in _pair_iter(tokens): File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 312, in _pair_iter prev = next(it) File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 581, in _annotate_first_pass for aug_tok in tokens: File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 546, in _tokenize_words for line in plaintext.split('\n'): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128) 句子,但我收到以下错误:

self.close(), self.destroy(),self.hide() self.window().hide(), self.window().destroy()

1 个答案:

答案 0 :(得分:0)

该问题与文件内容的编码有关。假设您要将str解码为UTF-8 unicode

选项1(在Python 3中不推荐使用):

import sys
reload(sys)
sys.setdefaultencoding('utf8')

选项2:
尝试打开文本文件时,将encode参数传递给open函数:

f = open('/path/to/txt/file', 'r+', encoding="utf-8")