Question

我正在使用Python 2.7.12。

当我解析以下XML文件时：

<?xml version="1.0" encoding="UTF-8" ?>
<data>value</data>

我通过以下方式检查元素文本的类型：

>>> from xml.etree import ElementTree
>>> type(ElementTree.parse('test.xml').getroot().text)
<type 'str'>

我很惊讶地看到它是str - 我期望的是unicode。仅当我将非ASCII字符引入XML文件时，例如：

<?xml version="1.0" encoding="UTF-8" ?>
<data>valuè</data>

然后文本存储为unicode：

>>> type(ElementTree.parse('test.xml').getroot().text)
<type 'unicode'>

首先，为什么xml API会出现这种不一致，其次我如何强制它始终使用unicode？

Answer 1

ElementTree.py中的XMLParser类（来自xml库）有一个小帮助函数，如果可能的话会尝试转换为ascii，但如果不能这样做则返回unicode：

def _fixtext(self, text):
    # convert text string to ascii, if possible
    try:
        return text.encode("ascii")
    except UnicodeError:
        return text

这就是为什么您会看到类型发生变化的原因。

以下是源代码的链接： https://hg.python.org/cpython/file/2.7/Lib/xml/etree/ElementTree.py#l1519

为什么xml元素的文本类型从str到unicode不等？

1 个答案: