两台机器,都运行Ubuntu 14.04.1。相同的源代码在相同的数据上运行。一个工作正常,一个抛出编解码器解码0xe2错误。为什么是这样? (更重要的是,我该如何解决?)
违规代码似乎是:
def tokenize(self):
"""Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing"""
tokenized=''
for sentence in sent_tokenize(self):
tokenized += ' '.join(word_tokenize(sentence)) + '\n'
return Text(tokenized)
好的......我进入了交互模式,并从两台机器上的nltk.tokenize导入了sent_tokenize。有效的人对以下内容感到满意:
>>> fh = open('in/train/legal/legal1a_lm_7.txt')
>>> foo = fh.read()
>>> fh.close()
>>> sent_tokenize(foo)
有问题的计算机上的UnicodeDecodeError提供以下回溯:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
逐行断开输入文件(通过split('\ n')),并通过sent_tokenize运行每个输入文件,将我们带到违规行:
If you have purchased these Services directly from Cisco Systems, Inc. (“Cisco”), this document is incorporated into your Master Services Agreement or equivalent services agreement (“MSA”) executed between you and Cisco.
实际上是:
>>> bar[5]
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.'
更新:两台机器都显示UnicodeDecodeError:
unicode(bar[5])
但只有一台机器显示错误:
sent_tokenize(bar[5])
答案 0 :(得分:0)
不同的NLTK版本!
没有barf的版本正在使用NLTK 2.0.4;抛出异常的版本是3.0.0。
NLTK 2.0.4非常满意
sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)')
NLTK 3.0.0需要unicode(正如@tdelaney在上面的评论中指出的那样)。因此,要获得结果,您需要:
sent_tokenize(u'(\u201cCisco\u201d)')