python etree使用html实体解析xml(保持html格式)

时间:2017-03-02 16:43:07

标签: python html xml elementtree

我有以下xml:

<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:app="http://www.w3.org/2007/app" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:metadata="http://xmlns.escenic.com/2010/atom-metadata">
 <content type="application/vnd.vizrt.payload+xml">
    <vdf:payload xmlns:vdf="http://www.vizrt.com/types">
      <vdf:field name="body">
        <vdf:value>

          <div xmlns="http://www.w3.org/1999/xhtml">
            <p>I saluti dal Sud partono con <strong>Elsa Albonico</strong>, storica  "golosit&#xE0;", con i pi&#xF9; piccoli "fare le conte".</p>
            <p>I saluti dal Nord la <a href="http://www.proticino.ch/sezioni-in-svizzera/basilea/">Pro Ticino di Basilea</a> con un particolarit&#xE0; frammenti&#xA0;&#xA0; </p>
            <p><a href="https://www.rts.ch/">RTS</a> "Kiosque &#xE0; Musiques" con <strong>Jean-Marc Richard</strong>. <br/>A fare da<em> fil&#xA0;rouge</em> al nostro </p>
            <p>
              <a href="http://internal.publishing.production.rsi.ch/webservice/escenic/content/8762014" id="_360b1131-e6a5-49b6-995e-a624c888617a">Le foto del gioco, Finestra popolare 26.02.2017</a>
            </p>
          </div>

        </vdf:value>
      </vdf:field>
    </vdf:payload>
  </content>
 </entry>

“body”字段是我必须以html格式复制到另一个文件的HTML(因此不允许替换或其他技巧)

我正在使用python和eTree。

有办法做到这一点吗?

我已经尝试使用尾部而不是文本,但我正在丢失格式 HTML这是一个大问题。

请帮忙。

感谢

CP

1 个答案:

答案 0 :(得分:0)

这是一个非常难看的解决方案,但有效!作为家庭作业,让它变得更好!

import xml.etree.ElementTree as ET

data = '''<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:app="http://www.w3.org/2007/app" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:metadata="http://xmlns.escenic.com/2010/atom-metadata">
 <content type="application/vnd.vizrt.payload+xml">
    <vdf:payload xmlns:vdf="http://www.vizrt.com/types">
      <vdf:field name="body">
        <vdf:value>

          <div xmlns="http://www.w3.org/1999/xhtml">
            <p>I saluti dal Sud partono con <strong>Elsa Albonico</strong>, storica  "golosit&#xE0;", con i pi&#xF9; piccoli "fare le conte".</p>
            <p>I saluti dal Nord la <a href="http://www.proticino.ch/sezioni-in-svizzera/basilea/">Pro Ticino di Basilea</a> con un particolarit&#xE0; frammenti&#xA0;&#xA0; </p>
            <p><a href="https://www.rts.ch/">RTS</a> "Kiosque &#xE0; Musiques" con <strong>Jean-Marc Richard</strong>. <br/>A fare da<em> fil&#xA0;rouge</em> al nostro </p>
            <p>
              <a href="http://internal.publishing.production.rsi.ch/webservice/escenic/content/8762014" id="_360b1131-e6a5-49b6-995e-a624c888617a">Le foto del gioco, Finestra popolare 26.02.2017</a>
            </p>
          </div>

        </vdf:value>
      </vdf:field>
    </vdf:payload>
  </content>
 </entry>'''

tree = ET.fromstring(data)
div = tree.getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0]

with open('./result.html', 'w') as html:
    html.writelines([i for i in div.itertext()])