Question

我需要帮助才能理解为什么使用 xml.etree.ElementTree 解析我的xml文件*会产生以下错误。

* 我的测试xml文件包含阿拉伯字符。

任务： 打开并解析utf8_file.xml文件。

我的第一次尝试：

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_tree = etree.parse(utf8_file)

结果1：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)

我的第二次尝试：

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
    xml_tree  = etree.fromstring(xml_string)

结果2：

AttributeError: 'file' object has no attribute 'getiterator'

请解释上述错误并对可能的解决方案发表评论。

Answer 1

将字节解码到解析器;首先不解码：

import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)

XML文件必须在第一行中包含足够的信息来处理解析器的解码。如果缺少标头，则解析器必须假定使用UTF-8。

因为它是包含此信息的XML头，所以解析器负责进行所有解码。

您的第一次尝试失败，因为Python再次尝试编码 Unicode值，以便解析器可以按预期处理字节字符串。第二次尝试失败，因为etree.tostring()期望解析的树作为第一个参数，而不是unicode字符串。

如何使用ElementTree正确解析utf-8 xml？

1 个答案: