textblob教程中的UnicodeDecodeError

时间:2013-09-24 16:28:41

标签: python nltk textblob

我正在尝试使用Python 3.3运行Windows中的TextBlob教程(使用Git Bash shell)。

我已经安装了textblobnltk以及任何依赖项。

Python代码是:

from text.blob import TextBlob

wiki = TextBlob("Python is a high-level, general-purpose programming language.")
tags = wiki.tags

我收到以下错误

Traceback (most recent call last):
File "textblob.py", line 4, in <module> 
  tags = wiki.tags
File "c:\Python33\lib\site-packages\text\decorators.py", line 18, in __get__ 
  value = obj.__dict__[self.func.__name__] = self.func(obj)
File "c:\Python33\lib\site-packages\text\blob.py", line 357, in pos_tags 
  for word, t in self.pos_tagger.tag(self.raw)
File "c:\Python33\lib\site-packages\text\taggers.py", line 40, in tag
  return pattern_tag(sentence, tokenize)
File "c:\Python33\lib\site-packages\text\en.py", line 115, in tag
  for sentence in parse(s, tokenize, True, False, False, False, encoding).split():
File "c:\Python33\lib\site-packages\text\en.py", line 99, in parse
  return parser.parse(unicode(s), *args, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1213, in parse
  s[i] = self.find_tags(s[i], **kwargs)
File "c:\Python33\lib\site-packages\text\en.py", line 49, in find_tags
  return _Parser.find_tags(self, tokens, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1161, in find_tags
  map = kwargs.get(     "map", None))
File "c:\Python33\lib\site-packages\text\text.py", line 967, in find_tags
  tagged.append([token, lexicon.get(token, i==0 and lexicon.get(token.lower()) or   None)])
File "c:\Python33\lib\site-packages\text\text.py", line 98, in get
  return self._lazy("get", *args)
File "c:\Python33\lib\site-packages\text\text.py", line 79, in _lazy
  self.load()
File "c:\Python33\lib\site-packages\text\text.py", line 367, in load
  dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 367, in <genexpr>
  dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 346, in _read
  for line in f:
File "c:\Python33\lib\encodings\cp1252.py", line 23, in decode
  return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>

知道这里有什么问题吗?在字符串之前添加'u'没有帮助。

1 个答案:

答案 0 :(得分:3)

版本0.7.1解决了这个问题,这意味着是时候了

$ pip install -U textblob

问题在于,用于词性标注的en-lexicon.txt文件使用Windows的默认平台编码cp1252打开了该文件。该文件显然具有Python无法从此编码解码的字符。这是通过以utf-8模式显式打开文件来解决的。