lxml Python包将版权符号更改为HTML实体

时间:2017-12-12 19:09:05

标签: python lxml

我有一个Python程序,它读取XML文件并修改版本属性。其中一些文件还有版权声明,版权符号为&#169。 lxml包将这些转换为HTML实体resolve_entities。有办法防止这种情况吗?

我尝试使用XMLParser函数的# coding: utf-8 import os import glob import argparse from lxml import etree xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False) etree.set_default_parser(xParser) someXML ='<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>' doc = etree.fromstring(someXML) print(someXML) print(etree.tostring(doc)) 参数,但这没有任何效果。我试过Python 2.7和3.6.3。下面的程序适用于Python 3。

<node version="1.0.1"><copyright>Copyright c 2017 by me</copyright></node>
b'<node version="1.0.1"><copyright>Copyright &#169; 2017 by me</copyright></node>'

打印出来:

{{1}}

1 个答案:

答案 0 :(得分:1)

您可以在转储到字符串时指定unicode编码:

etree.tostring(doc, encoding="unicode")

演示:

In [1]: from lxml import etree

In [2]: xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False)

In [3]: someXML ='<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>'

In [4]: doc = etree.fromstring(someXML, parser=xParser)

In [5]: print(someXML)
<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>

In [6]: print(etree.tostring(doc, encoding="unicode"))
<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>