我在python 2.7中使用xml.etree.ElementTree
并且遇到了往返字符串的问题。如果树中存在非ascii Unicode字符,则ET.fromstring()
上的ET.tostring()
调用将失败。
为什么这不起作用?鉴于ElementTree
需要字节流并进行自己的解码,为什么它默认为ASCII解析器?这是由我忽略的东西决定的,比如python文件或语言环境的编码?
ASCII only chars工作:
import xml.etree.ElementTree as ET
t1 = ET.Element('test')
t1.text = u'hello world'
t1_roundtrip = ET.fromstring(ET.tostring(t1, encoding='utf8', method='xml'))
# ET.dump(t1) == ET.dump(t1_roundtrip)
Unicode代码点失败:
import xml.etree.ElementTree as ET
t2 = ET.Element('test')
t2.text = u'\u2603'
t2_roundtrip = ET.fromstring(ET.tostring(t2, encoding='utf8', method='xml'))
>>> t2_roundtrip = ET.fromstring(ET.tostring(t2, encoding='utf8', method='xml'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
答案 0 :(得分:2)
您已指定非法编码。引用ElementTree doc:
XML输出中包含的编码字符串应符合相应的标准。例如,“UTF-8”有效,但“UTF8”不有效。请参阅http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDecl和http://www.iana.org/assignments/character-sets。
答案 1 :(得分:0)
找到两种解决方法:
不包含tostring()
的编码:
import xml.etree.ElementTree as ET
t3 = ET.Element('test')
t3.text = u'\u2603'
t3_roundtrip = ET.fromstring(ET.tostring(t3, method='xml'))
使用utf-8编码指定XMLParser
:
import xml.etree.ElementTree as ET
t4 = ET.Element('test')
t4.text = u'\u2603'
t4_roundtrip_utf = ET.fromstring(
ET.tostring(t3, encoding='utf8', method='xml'),
parser=ET.XMLParser(encoding='utf-8'))
为什么我需要?除非另有说明,否则Aren的XML文件为utf-8?