基于保留Python中每个父节点的所有子节点的属性对XML进行排序

时间:2017-12-22 04:52:48

标签: python xml

我有一个xml文件,我想根据属性值进行排序。以下是xml文件:

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
 </images>
</dataset>

所需的输出是:

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
 </images>
</dataset>

我尝试了以下两个选项:

import xml.etree.ElementTree as ET
tree = ET.parse("finalxml.xml")
container = tree.find("images")
data = []
for elem in container:
    key = elem.findtext("image")
    data.append((key,elem))
data.sort()
container[:] = [item[-1] for item in data]
tree.write("new-data.xml")

此代码只是重新排列框属性,而不是图像文件属性,这是不可取的。以下是我从SO中获取的内容,但没有做任何事情。

# =======================================================================
# Monkey patch ElementTree
import xml.etree.ElementTree as ET

def _serialize_xml(write, elem, encoding, qnames, namespaces):
    tag = elem.tag
    text = elem.text
    if tag is ET.Comment:
        write("<!--%s-->" % ET._encode(text, encoding))
    elif tag is ET.ProcessingInstruction:
        write("<?%s?>" % ET._encode(text, encoding))
    else:
        tag = qnames[tag]
        if tag is None:
            if text:
                write(ET._escape_cdata(text, encoding))
            for e in elem:
                _serialize_xml(write, e, encoding, qnames, None)
        else:
            write("<" + tag)
            items = elem.items()
            if items or namespaces:
                if namespaces:
                    for v, k in sorted(namespaces.items(),
                                       key=lambda x: x[1]):  # sort on prefix
                        if k:
                            k = ":" + k
                        write(" xmlns%s=\"%s\"" % (
                            k.encode(encoding),
                            ET._escape_attrib(v, encoding)
                            ))
                #for k, v in sorted(items):  # lexical order
                for k, v in items: # Monkey patch
                    if isinstance(k, ET.QName):
                        k = k.text
                    if isinstance(v, ET.QName):
                        v = qnames[v.text]
                    else:
                        v = ET._escape_attrib(v, encoding)
                    write(" %s=\"%s\"" % (qnames[k], v))
            if text or len(elem):
                write(">")
                if text:
                    write(ET._escape_cdata(text, encoding))
                for e in elem:
                    _serialize_xml(write, e, encoding, qnames, None)
                write("</" + tag + ">")
            else:
                write(" />")
    if elem.tail:
        write(ET._escape_cdata(elem.tail, encoding))

ET._serialize_xml = _serialize_xml

from collections import OrderedDict

class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):
    def _start_list(self, tag, attrib_in):
        fixname = self._fixname
        tag = fixname(tag)
        attrib = OrderedDict()
        if attrib_in:
            for i in range(0, len(attrib_in), 2):
                attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
        return self._target.start(tag, attrib)


tree = ET.parse("example1.xml", OrderedXMLTreeBuilder())
tree.write("new-data.xml")

如何对xml进行排序?

1 个答案:

答案 0 :(得分:1)

使用list.sortkey命名参数,使用每个file标记的<image>属性作为排序键:

  

key指定一个参数的函数,该函数用于从每个列表元素中提取比较键(例如,key = str.lower)。对应于列表中每个项目的密钥计算一次,然后用于整个分类过程。默认值None表示直接对列表项进行排序,而不计算单独的键值。

import xml.etree.ElementTree

xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
 </images>
</dataset>'''

root = xml.etree.ElementTree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images

print(xml.etree.ElementTree.tostring(root))

使用lxml基于this answer的备用解决方案,指出lxml按照设置顺序序列化属性(与xml不同):

import lxml.etree

xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <text>lol</text>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
 </images>
</dataset>'''

root = lxml.etree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images

print(lxml.etree.tostring(root))

注意:这会删除<images>中不属于<image>的任何子女(直系后代)。