众所周知,XML文档中不允许certain character ranges。我知道可以过滤掉这些字符的解决方案(例如[1],[2])。
采用不要重复自己的原则,我宁愿在一个中心点实施其中一个解决方案 - 现在,我必须在将任何可能不安全的文本传送到lxml
之前对其进行清理。有没有办法实现这一目标,例如通过继承lxml
过滤器类,捕获一些异常或设置配置开关?
编辑:希望稍微澄清一下这个问题,这里有一个示例代码:
from lxml import etree
root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800'
print(etree.tostring(root))
root.text += '\x02'.decode("utf-8")
执行此操作会得到结果
<root>�</root>
Traceback (most recent call last):
File "[…]", line 9, in <module>
root.text += u'\u0002'
File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44956)
File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
如您所见,2字节会引发异常,但lxml很乐意转义其他两个超出范围的字符。真正的麻烦是
s = "<root>�</root>"
root = etree.fromstring(s)
也会引发异常。在我看来,这种行为有点令人不安,特别是因为它产生了无效的XML文档。
原来这可能是2比3的问题。使用python3.4,上面的代码抛出异常
Traceback (most recent call last):
File "[…]", line 5, in <module>
root.text += u'\ud800'
File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
File "apihelpers.pxi", line 1387, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26380)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed
唯一剩下的问题是\uffff
字符,lxml
仍然乐意接受。
答案 0 :(得分:1)
在LXML中解析字符串之前,只需过滤字符串:cleaning invalid characters from XML (gist by lawlesst)。
我用你的代码试了一下;它似乎工作,除了你需要更改要点导入 re 和 sys !
from lxml import etree
from cleaner import invalid_xml_remove
root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800'
print(etree.tostring(root))
root.text += invalid_xml_remove('\x02'.decode("utf-8"))