如何使用Python在其文本中使用包含html标签的xml标签编写XML文件?

时间:2016-11-21 16:57:48

标签: python html xml wordpress xml-parsing

我最近意识到,在一些标签的正文中包含HTML标签的XML似乎会使像WP All Import这样的解析器窒息。

为了缓解这种情况,我尝试编写一个Python脚本来正确地输出XML。

它从这个XML文件开始(这只是一个摘录):

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
    </Row>
    ...
</Root>

所需的输出是:

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
        <Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
    </Row>
    ...
</Root>

不幸的是,我得到了以下奇怪的转义字符,如:

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (&#24503;&#22825;&#28689;&#24067; [D&#233;ti&#257;n P&#249;b&#249;])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
        <Introduction_Body>&lt;![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like &lt;a href="http://blah.com/blah.html"&gt;&lt;/a&gt;, &lt;br&gt;, &lt;img src="http://blahimg.jpg"&gt;, etc. It should also preserve carriage returns and characters like &#24503;&#22825;&#28689;&#24067; [D&#233;ti&#257;n P&#249;b&#249;]...]]&gt; </Introduction_Body>
    </Row>
    ...
</Root>

所以我想修复以下内容: 1)输出新的XML文件,该文件保留文本,包括新引入的&#34; Introduction_Body&#34;标签以及任何其他标签,如&#34; Waterfall_Name&#34; 2)是否可以干净利落地印刷(为了人类的可读性)?怎么样?

我的Python代码目前看起来像这样:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET
import os

data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()

for element in root:
    if element.find('File_directory') is not None:
        directory = element.find('File_directory').text
        if element.find('Introduction') is not None:
            introduction = element.find('Introduction').text
            intro_tree = directory+introduction
            with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
                intro_text = f.read()
                intro_body = ET.SubElement(element,'Introduction_Body')
                intro_body.text = '<![CDATA[' + intro_text + ']]>'

#tree.write('new_' + data_file)  #same result but leaves out the xml header

f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()

谢谢, 约翰尼

1 个答案:

答案 0 :(得分:1)

我建议您切换到lxml。它有充分的文档记录,并且(几乎)与python自己的xml完全兼容。您可能只需要最低限度地更改代码。 lxml非常轻松地支持CDATA

> from lxml import etree
> elmnt = etree.Element('root')
> elmnt.text = etree.CDATA('abcd')
> etree.dump(elmnt)

<root><![CDATA[abcd]]></root>

除此之外,你绝对应该使用你使用的任何库,不仅用于解析xml,还用于编写它! lxml将为您做声明:

> print(etree.tostring(elmnt, encoding="utf-8"))

<?xml version='1.0' encoding='utf-8'?>
<root><![CDATA[abcd]]></root>