Question

我得到了

ElementTree.ParseError：对无效字符编号的引用

解析包含以下内容的XML作为标记值时：locat

我的代码如下：

respXML = httpResponse.content
#also possible respXML = httpResponse.content.decode("utf-8") 
#but both get the same error

#this line throws the error
respRoot = ET.fromstring(respXML)

如何防止我的解析器出现看似无效的字符数？

Answer 1

看起来像html。在输入字符串之前查看是否使用html包。 https://pypi.python.org/pypi/html

>>> import html
>>> test = "loca&#1;t"
>>> html.unescape(test)
'local'

然后将一些已知的unicode字符转换为它们的等价物。即

“ => "
’ => '
...

最后将双空格替换为单个空格。

由于预先成功解决所有问题非常麻烦 - 我建议放置特定的异常并将坏行写入文件。通过添加更多规则逐个解决输出文件中的每个错误。

祝你好运。

Answer 2

我有时发现用正则表达式模式（例如(re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s)）保存原始输入字符很有用。例如，使用

from xml.etree import ElementTree as ET
import re
s = "<Tag>loca&#1;t</Tag>"

使用html.unescape生成

ET.fromstring(html.unescape(s)).text
#Out: 'locat'

但是提到的正则表达式模式产生

ET.fromstring(re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s)).text
#Out: 'loca[#1;]t'

保留“坏字符”。

ElementTree.ParseError：对无效字符编号的引用

2 个答案: