我有一个Python程序,它读取XML文件并修改版本属性。其中一些文件还有版权声明,版权符号为©
。 lxml包将这些转换为HTML实体resolve_entities
。有办法防止这种情况吗?
我尝试使用XMLParser函数的# coding: utf-8
import os
import glob
import argparse
from lxml import etree
xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False)
etree.set_default_parser(xParser)
someXML ='<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>'
doc = etree.fromstring(someXML)
print(someXML)
print(etree.tostring(doc))
参数,但这没有任何效果。我试过Python 2.7和3.6.3。下面的程序适用于Python 3。
<node version="1.0.1"><copyright>Copyright c 2017 by me</copyright></node>
b'<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>'
打印出来:
{{1}}
答案 0 :(得分:1)
您可以在转储到字符串时指定unicode
编码:
etree.tostring(doc, encoding="unicode")
演示:
In [1]: from lxml import etree
In [2]: xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False)
In [3]: someXML ='<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>'
In [4]: doc = etree.fromstring(someXML, parser=xParser)
In [5]: print(someXML)
<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>
In [6]: print(etree.tostring(doc, encoding="unicode"))
<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>