我发现cElementTree比xml.dom.minidom
快大约30倍,我正在重写我的XML编码/解码代码。但是,我需要输出包含CDATA部分的XML,而且似乎没有办法使用ElementTree。
可以吗?
答案 0 :(得分:24)
经过一番努力,我自己找到了答案。查看ElementTree.py源代码,我发现XML注释和预处理指令有特殊处理。他们所做的是为特殊元素类型创建一个工厂函数,该函数使用特殊(非字符串)标记值来区分它与常规元素。
def Comment(text=None):
element = Element(Comment)
element.text = text
return element
然后在实际输出XML的ElementTree的_write
函数中,有一个特殊的案例处理注释:
if tag is Comment:
file.write("<!-- %s -->" % _escape_cdata(node.text, encoding))
为了支持CDATA部分,我创建了一个名为CDATA
的工厂函数,扩展了ElementTree类并更改了_write
函数来处理CDATA元素。
如果你想用CDATA部分解析XML然后再用CDATA部分输出它,这仍然无济于事,但它至少允许你以编程方式创建带有CDATA部分的XML,这是我需要做的
该实现似乎适用于ElementTree和cElementTree。
import elementtree.ElementTree as etree
#~ import cElementTree as etree
def CDATA(text=None):
element = etree.Element(CDATA)
element.text = text
return element
class ElementTreeCDATA(etree.ElementTree):
def _write(self, file, node, encoding, namespaces):
if node.tag is CDATA:
text = node.text.encode(encoding)
file.write("\n<![CDATA[%s]]>\n" % text)
else:
etree.ElementTree._write(self, file, node, encoding, namespaces)
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = etree.Element("data")
cdata = CDATA(text)
e.append(cdata)
et = ElementTreeCDATA(e)
et.write(sys.stdout, "utf-8")
答案 1 :(得分:17)
答案 2 :(得分:10)
以下是适用于python 3.2的gooli解决方案的变体:
import xml.etree.ElementTree as etree
def CDATA(text=None):
element = etree.Element('![CDATA[')
element.text = text
return element
etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
if elem.tag == '![CDATA[':
write("\n<%s%s]]>\n" % (
elem.tag, elem.text))
return
return etree._original_serialize_xml(
write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = etree.Element("data")
cdata = CDATA(text)
e.append(cdata)
et = etree.ElementTree(e)
et.write(sys.stdout.buffer.raw, "utf-8")
答案 3 :(得分:6)
AFAIK不可能......这很可惜。基本上,ElementTree模块假设读者是100%XML兼容的,因此如果他们输出一个CDATA部分或其他生成等效文本的格式,则无关紧要。
有关详细信息,请参阅Python邮件列表中的this thread。基本上,他们建议使用某种基于DOM的XML库。
答案 4 :(得分:6)
实际上这段代码有一个错误,因为你没有抓住]]>
出现在你作为CDATA插入的数据中
根据Is there a way to escape a CDATA end token in xml?
在这种情况下你应该把它分成两个CDATA,将]]>
分成两个。
基本上data = data.replace("]]>", "]]]]><![CDATA[>")
(不一定正确,请核实)
答案 5 :(得分:6)
我不知道先前版本的拟议代码是否运行良好以及ElementTree模块是否已更新但我在使用此技巧时遇到了问题:
etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
if elem.tag == '![CDATA[':
write("\n<%s%s]]>\n" % (
elem.tag, elem.text))
return
return etree._original_serialize_xml(
write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml
这种方法的问题在于,在传递此异常之后,序列化程序再次将其视为普通标记。我得到了类似的东西:
<textContent>
<![CDATA[this was the code I wanted to put inside of CDATA]]>
<![CDATA[>this was the code I wanted to put inside of CDATA</![CDATA[>
</textContent>
当然,我们知道这只会导致很多错误。 为什么会这样呢?
答案就在这个小家伙身上:
return etree._original_serialize_xml(write, elem, qnames, namespaces)
如果我们已经困住了我们的CDATA并成功通过了它,我们不想再通过原始的序列化函数来检查代码。 因此,在“if”块中,只有当CDATA不存在时,我们才必须返回原始序列化函数。在返回原始函数之前,我们错过了“else”。
此外,在我的版本ElementTree模块中,serialize函数拼命地要求“short_empty_element”参数。因此,我推荐的最新版本看起来像这样(也是“尾巴”):
from xml.etree import ElementTree
from xml import etree
#in order to test it you have to create testing.xml file in the folder with the script
xmlParsedWithET = ElementTree.parse("testing.xml")
root = xmlParsedWithET.getroot()
def CDATA(text=None):
element = ElementTree.Element('![CDATA[')
element.text = text
return element
ElementTree._original_serialize_xml = ElementTree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):
if elem.tag == '![CDATA[':
write("\n<{}{}]]>\n".format(elem.tag, elem.text))
if elem.tail:
write(_escape_cdata(elem.tail))
else:
return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)
ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)
#tests
print(root)
print(root.getchildren()[0])
print(root.getchildren()[0].text + "\n\nyay!")
我得到的输出是:
<Element 'Database' at 0x10062e228>
<Element '![CDATA[' at 0x1021cc9a8>
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
yay!
祝你有同样的结果!
答案 6 :(得分:4)
这最终在Python 2.7中为我工作。类似于阿毛里的回答。
import xml.etree.ElementTree as ET
ET._original_serialize_xml = ET._serialize_xml
def _serialize_xml(write, elem, encoding, qnames, namespaces):
if elem.tag == '![CDATA[':
write("<%s%s]]>%s" % (elem.tag, elem.text, elem.tail))
return
return ET._original_serialize_xml(
write, elem, encoding, qnames, namespaces)
ET._serialize_xml = ET._serialize['xml'] = _serialize_xml
答案 7 :(得分:1)
DOM具有(至少在第2级)接口 DATASection和一个操作Document :: createCDATASection。他们是 扩展接口,仅在实现支持时才支持 “xml”功能。
来自xml.dom import minidom
my_xmldoc = minidom.parse中(XMLFILE)
my_xmldoc.createCDATASection(数据)
现在你有cadata节点在任何你想要的地方添加....
答案 8 :(得分:1)
接受的解决方案无法与 Python 2.7 一起使用。但是,还有另一个名为lxml的程序包(尽管速度稍慢)与xml.etree.ElementTree
共享一个大致相同的语法。 lxml
能够编写和解析CDATA
。文档here
答案 9 :(得分:1)
我发现了使用评论让CDATA工作的黑客攻击:
node.append(etree.Comment(' --><![CDATA[' + data.replace(']]>', ']]]]><![CDATA[>') + ']]><!-- '))
答案 10 :(得分:1)
您可以覆盖ElementTree _escape_cdata
函数:
import xml.etree.ElementTree as ET
def _escape_cdata(text, encoding):
try:
if "&" in text:
text = text.replace("&", "&")
# if "<" in text:
# text = text.replace("<", "<")
# if ">" in text:
# text = text.replace(">", ">")
return text
except TypeError:
raise TypeError(
"cannot serialize %r (type %s)" % (text, type(text).__name__)
)
ET._escape_cdata = _escape_cdata
请注意,您可能不需要传递额外的encoding
参数,具体取决于您的库/ python版本。
现在您可以像这样将CDATA写入obj.text
:
root = ET.Element('root')
body = ET.SubElement(root, 'body')
body.text = '<![CDATA[perform extra angle brackets escape for this text]]>'
print(ET.tostring(root))
并获得清晰的CDATA节点:
<root>
<body>
<![CDATA[perform extra angle brackets escape for this text]]>
</body>
</root>
答案 11 :(得分:0)
这是我的版本,它基于gooli和amaury的上述答案。它适用于ElementTree 1.2.6和1.3.0,它们使用非常不同的方法。
请注意,gooli不适用于1.3.0,这似乎是Python 2.7.x中的当前标准。
另请注意,此版本不使用gooli使用的CDATA()方法。
import xml.etree.cElementTree as ET
class ElementTreeCDATA(ET.ElementTree):
"""Subclass of ElementTree which handles CDATA blocks reasonably"""
def _write(self, file, node, encoding, namespaces):
"""This method is for ElementTree <= 1.2.6"""
if node.tag == '![CDATA[':
text = node.text.encode(encoding)
file.write("\n<![CDATA[%s]]>\n" % text)
else:
ET.ElementTree._write(self, file, node, encoding, namespaces)
def _serialize_xml(write, elem, qnames, namespaces):
"""This method is for ElementTree >= 1.3.0"""
if elem.tag == '![CDATA[':
write("\n<![CDATA[%s]]>\n" % elem.text)
else:
ET._serialize_xml(write, elem, qnames, namespaces)
答案 12 :(得分:0)
我到这里寻找一种方法来“用CDATA部分解析XML然后再用CDATA部分输出”。
我能够做到这一点(也许lxml自这篇帖子以来已经更新了吗?)以下内容:(有点粗糙 - 对不起;-)。其他人可能有更好的方式以编程方式找到CDATA部分,但我太懒了。
parser = etree.XMLParser(encoding='utf-8') # my original xml was utf-8 and that was a lot of the problem
tree = etree.parse(ppath, parser)
for cdat in tree.findall('./ProjectXMPMetadata'): # the tag where my CDATA lives
cdat.text = etree.CDATA(cdat.text)
# other stuff here
tree.write(opath, encoding="UTF-8",)
答案 13 :(得分:0)
对于python3和ElementTree,您可以使用下一个接收方
import xml.etree.ElementTree as ET
ET._original_serialize_xml = ET._serialize_xml
def serialize_xml_with_CDATA(write, elem, qnames, namespaces, short_empty_elements, **kwargs):
if elem.tag == 'CDATA':
write("<![CDATA[{}]]>".format(elem.text))
return
return ET._original_serialize_xml(write, elem, qnames, namespaces, short_empty_elements, **kwargs)
ET._serialize_xml = ET._serialize['xml'] = serialize_xml_with_CDATA
def CDATA(text):
element = ET.Element("CDATA")
element.text = text
return element
my_xml = ET.Element("my_name")
my_xml.append(CDATA("<p>some text</p>")
tree = ElementTree(my_xml)
如果您需要xml作为str,则可以使用
ET.tostring(tree)
或下一次黑客入侵(与tostring()
中的代码几乎相同)
fake_file = BytesIO()
tree.write(fake_file, encoding="utf-8", xml_declaration=True)
result_xml_text = str(fake_file.getvalue(), encoding="utf-8")
并获得结果
<?xml version='1.0' encoding='utf-8'?>
<my_name>
<![CDATA[<p>some text</p>]]>
</my_name>
答案 14 :(得分:0)
主要思想是我们将元素树隐藏为字符串,然后在其上调用unescape。有了字符串后,我们将使用标准python将字符串写入文件。
基于: How to write unescaped string to a XML element with ElementTree?
import xml.etree.ElementTree as ET
from xml.sax.saxutils import unescape
# defining the tree structure
element1 = ET.Element('test1')
element1.text = '<![CDATA[Wired & Forbidden]]>'
# & and <> are in a weird format
string1 = ET.tostring(element1).decode()
print(string1)
# now they are not weird anymore
# more formally, we unescape '&', '<', and '>' in a string of data
# from https://docs.python.org/3.8/library/xml.sax.utils.html#xml.sax.saxutils.unescape
string1 = unescape(string1)
print(string1)
element2 = ET.Element('test2')
element2.text = '<![CDATA[Wired & Forbidden]]>'
string2 = unescape(ET.tostring(element2).decode())
print(string2)
# make the xml file and open in append mode
with open('foo.xml', 'a') as f:
f.write(string1 + '\n')
f.write(string2)
<test1><![CDATA[Wired & Forbidden]]></test1>
<test2><![CDATA[Wired & Forbidden]]></test2>