解析xml时保留实体引用

时间:2013-07-26 21:00:08

标签: python xml parsing elementtree

运行以下简单脚本:

from lxml import etree

tree = etree.parse('VimKeys.xml')
root = tree.getroot()

for child in root:
  print ("<table>")
  print ("<caption>" + child.attrib['title'] + "</caption>")
  for child in child:
    print ("<tr>")
    print ("<th>" + child.text + "</th>")
    print ("<td>" + child.attrib['description'] + "</td>")
    print ("</tr>")
  print ("</table>")

针对以下xml:

<keycommands>
  <category title="Editing">
    <key description="replace">r</key>
    <key description="change, line">c,cc</key>
    <key description="join line with the following">J</key>
    <key description="delete &amp; insert">s</key>
    <key description="change case">~</key>
    <key description="apply last edit">.</key>
    <key description="undo, redo">u,⌃+r</key>
    <key description="indent line right, left">&gt;&gt;,&lt;&lt;</key>
    <key description="auto-indent line">==</key>
  </category>
</keycommands>

导致以下结果:

<caption>Editing</caption>
  <tr>
    <th>r</th>
    <td>replace</td>
  </tr>
  <tr>
    <th>c,cc</th>
    <td>change, line</td>
  </tr>
  <tr>
    <th>J</th>
    <td>join line with the following</td>
  </tr>
  <tr>
    <th>s</th>
    <td>delete & insert</td>
  </tr>
  <tr>
    <th>~</th>
    <td>change case</td>
  </tr>
  <tr>
    <th>.</th>
    <td>apply last edit</td>
  </tr>
  <tr>
    <th>u,⌃+r</th>
    <td>undo, redo</td>
  </tr>
  <tr>
    <th>>>,<<</th>
    <td>indent line right, left</td>
  </tr>
  <tr>
    <th>==</th>
    <td>auto-indent line</td>
  </tr>
</table>

这是无效的HTML,因为小于和大于引用为

的符号
&lt; and &gt;

在源文档中。

如何在最终产品中保留这些产品?

1 个答案:

答案 0 :(得分:1)

使用the Element class构建新的XML树,而不是使用print来“手动”格式化它:

import lxml.etree as ET

tree = ET.parse('VimKeys.xml')
root = tree.getroot()

newroot = ET.Element('root')
for i, child in enumerate(root):
    table = ET.Element('table')
    newroot.insert(i, table)
    caption = ET.Element('caption')
    caption.text = child.attrib['title']
    table.insert(0, caption)
    for j, c in enumerate(child, 1):
        tr = ET.Element('tr')
        table.insert(j, tr)
        th = ET.Element('th')
        th.text = c.text
        tr.insert(0, th)

        td = ET.Element('td')
        td.text = c.attrib['description']
        tr.insert(1, td)

print(ET.tostring(newroot, pretty_print=True))

或者,使用the E-factory。这样做可以使预期的结构更易于阅读(和修改):

import lxml.etree as ET
import lxml.builder as builder

tree = ET.parse('VimKeys.xml')
root = tree.getroot()

E = builder.E
tables = []
for child in root:
    trs = []
    for c in child:
        trs.append(E('tr',
                     E('th', c.text),
                     E('td', c.attrib['description'])))
    tables.append(E('table',
                    E('caption', child.attrib['title']),
                    *trs))

newroot = E('root', *tables)
print(ET.tostring(newroot, pretty_print=True))