使用StanfordParser解析推文的UnicodeDecodeError

时间:2014-10-29 14:20:54

标签: python unicode nltk stanford-nlp

我正在尝试使用StanfordParser和无案例英语模型(englishPCFG.caseless.ser.gz)解析一组推文,这些模型在常见问题解答中提到:http://nlp.stanford.edu/software/parser-faq.shtml#ca。调用raw_parse方法时遇到以下错误:

import nltk
from nltk.parse.stanford import StanfordParser
parser = StanfordParser(
                      path_to_jar="stanford-parser.jar" \
                    , path_to_models_jar="stanford-corenlp-caseless-2014-02-25-models.jar" \
                    , model_path="edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz"
                    , encoding='utf-8'
                    )
tweet = 'Good News™: The weather is going to be awesome today for some ultimate.'
tweet_unicode = unicode(tweet, 'UTF-8')
parser.raw_parse(tweet_unicode)

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-21-71163f9030ad> in <module>()
----> 1 parser.raw_parse_sents(sentence)

/Library/Python/2.7/site-packages/nltk/parse/stanford.pyc in raw_parse_sents(self, sentences, verbose)
    174             '-outputFormat', 'penn',
    175         ]
--> 176         return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
    177 
    178     def tagged_parse(self, sentence, verbose=False):

/Library/Python/2.7/site-packages/nltk/parse/stanford.pyc in _execute(self, cmd, input_, verbose)
    235             stdout, stderr = java(cmd, classpath=(self._stanford_jar, self._model_jar),
    236                                   stdout=PIPE, stderr=PIPE)
--> 237             stdout = stdout.decode(encoding)
    238 
    239         os.unlink(input_file.name)

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xaa in position 56: invalid start byte

对于大多数包含特殊字符的推文,该方法运行正常;但是在这种特殊情况下,由于商标特征,它失败了。有关如何解决此问题的任何指示?检查解析器文件的源代码,看起来从解析器创建的临时文件读取时发生错误。

而且,更一般地说,如何确保大多数英文字符都被考虑在内?

0 个答案:

没有答案