python ElementTree解码错误

时间:2013-07-10 19:45:04

标签: python xml encoding

我正在尝试使用ElementTree方法输出到文本的tostring实例:

tostring(root, encoding='UTF-8')

我得到UnicodeDecodeError(下面的回溯),因为其中一个Element.text节点具有u'\u2014'字符。我将text属性设置如下:

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

如何将树成功序列化为文本?我是否错误地编码了节点?感谢。

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "crisis_app/converters/to_xml.py", line 129, in convert
    return tostring(root, encoding='UTF-8')
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 288: ordinal not in range(128)

1 个答案:

答案 0 :(得分:2)

如果你这样做:

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

您将文本设置为unicode字符的utf-8编码版本。它与

相同
el.text = '\xe2\x80\x94'

现在你不再拥有一个unicode字符,而是一系列字节。

如果你这样做:

tostring(root, encoding='UTF-8')

您说要将内容编码为utf-8。为此,在内部首先使用默认编码(ascii)将字符串解码为unicode,然后编码为utf-8,当然因为字符串中的字节不在ascii范围内而失败。

ElementTree完全能够使用unicode,所以只需给它unicode而不是str:

>>> from xml.etree import ElementTree as et
>>> e = et.Element('test')
>>> e.text = u'\u2014'

>>> s = et.tostring(e)
>>> print s, repr(s)
<test>&#8212;</test> '<test>&#8212;</test>'

>>> s = et.tostring(e, encoding='utf-8')
>>> print s, repr(s)
<test>—</test> '<test>\xe2\x80\x94</test>'