Question

我正在尝试在python中读取一个utf-8编码的xml文件，我正在对从文件中读取的行进行一些处理，如下所示：

next_sent_separator_index =  doc_content.find(word_value, int(characterOffsetEnd_value) + 1)

其中doc_content是从文件读取的行，word_value是同一行中的字符串之一。每当doc_content或word_value有一些Unicode字符时，我就会在上面的行中获得编码相关的错误。所以，我尝试先用utf-8解码（而不是默认的ascii编码）解码它们，如下所示：

next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)

但我仍然得到如下的UnicodeDecodeError：

Traceback (most recent call last):
  File "snippetRetriver.py", line 402, in <module>
    sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
  File "snippetRetriver.py", line 201, in getSentenceList
    next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)

有人能建议我采用合适的方法/方法来避免python 2.7中的这种编码错误吗？

Answer 1

codecs.utf_8_decode(input.encode('utf8'))

Python 2.7中的UnicodeDecodeError

1 个答案: