我正在尝试使用StanfordParser和无案例英语模型(englishPCFG.caseless.ser.gz)解析一组推文,这些模型在常见问题解答中提到:http://nlp.stanford.edu/software/parser-faq.shtml#ca。调用raw_parse方法时遇到以下错误:
import nltk
from nltk.parse.stanford import StanfordParser
parser = StanfordParser(
path_to_jar="stanford-parser.jar" \
, path_to_models_jar="stanford-corenlp-caseless-2014-02-25-models.jar" \
, model_path="edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz"
, encoding='utf-8'
)
tweet = 'Good News™: The weather is going to be awesome today for some ultimate.'
tweet_unicode = unicode(tweet, 'UTF-8')
parser.raw_parse(tweet_unicode)
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-21-71163f9030ad> in <module>()
----> 1 parser.raw_parse_sents(sentence)
/Library/Python/2.7/site-packages/nltk/parse/stanford.pyc in raw_parse_sents(self, sentences, verbose)
174 '-outputFormat', 'penn',
175 ]
--> 176 return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
177
178 def tagged_parse(self, sentence, verbose=False):
/Library/Python/2.7/site-packages/nltk/parse/stanford.pyc in _execute(self, cmd, input_, verbose)
235 stdout, stderr = java(cmd, classpath=(self._stanford_jar, self._model_jar),
236 stdout=PIPE, stderr=PIPE)
--> 237 stdout = stdout.decode(encoding)
238
239 os.unlink(input_file.name)
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xaa in position 56: invalid start byte
对于大多数包含特殊字符的推文,该方法运行正常;但是在这种特殊情况下,由于商标特征,它失败了。有关如何解决此问题的任何指示?检查解析器文件的源代码,看起来从解析器创建的临时文件读取时发生错误。
而且,更一般地说,如何确保大多数英文字符都被考虑在内?