对于unicode代码点,python xml.etree.ElementTree tostring()fromstring()往返失败

时间:2014-07-31 00:50:26

标签: python xml unicode elementtree

我在python 2.7中使用xml.etree.ElementTree并且遇到了往返字符串的问题。如果树中存在非ascii Unicode字符,则ET.fromstring()上的ET.tostring()调用将失败。

为什么这不起作用?鉴于ElementTree需要字节流并进行自己的解码,为什么它默认为ASCII解析器?这是由我忽略的东西决定的,比如python文件或语言环境的编码?

  1. ASCII only chars工作:

    import xml.etree.ElementTree as ET
    
    t1 = ET.Element('test')
    t1.text = u'hello world'
    t1_roundtrip = ET.fromstring(ET.tostring(t1, encoding='utf8', method='xml'))
    # ET.dump(t1) == ET.dump(t1_roundtrip)
    
  2. Unicode代码点失败:

    import xml.etree.ElementTree as ET
    
    t2 = ET.Element('test')
    t2.text = u'\u2603'
    t2_roundtrip = ET.fromstring(ET.tostring(t2, encoding='utf8', method='xml'))
    
    >>> t2_roundtrip = ET.fromstring(ET.tostring(t2, encoding='utf8', method='xml'))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1300, in XML
        parser.feed(text)
      File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1642, in feed
        self._raiseerror(v)
      File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
        raise err
    xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
    

2 个答案:

答案 0 :(得分:2)

您已指定非法编码。引用ElementTree doc

  
    

XML输出中包含的编码字符串应符合相应的标准。例如,“UTF-8”有效,但“UTF8”不有效。请参阅http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDeclhttp://www.iana.org/assignments/character-sets

  

答案 1 :(得分:0)

找到两种解决方法:

  1. 不包含tostring()的编码:

    import xml.etree.ElementTree as ET
    t3 = ET.Element('test')
    t3.text = u'\u2603'
    t3_roundtrip = ET.fromstring(ET.tostring(t3, method='xml'))
    
  2. 使用utf-8编码指定XMLParser

    import xml.etree.ElementTree as ET
    t4 = ET.Element('test')
    t4.text = u'\u2603'
    t4_roundtrip_utf = ET.fromstring(
        ET.tostring(t3, encoding='utf8', method='xml'),
        parser=ET.XMLParser(encoding='utf-8'))
    
  3. 为什么我需要?除非另有说明,否则A​​ren的XML文件为utf-8?