Question

我正在尝试解析范围为（20MB-3GB）的巨大XML文件。文件是来自不同仪器的样本。所以，我正在做的是从文件中找到必要的元素信息并将它们插入数据库（Django）。

我文件样本的一小部分。命名空间存在于所有文件中。文件的有趣特征是它们具有更多节点属性，然后是文本

<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">

    <instrumentConfiguration id="QTOF">
                    <cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
                    <componentList count="4">
                            <source order="1">
                                    <cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
                            </source>
                            <analyzer order="2">
                                    <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
                            </analyzer>
                            <analyzer order="3">
                                    <cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
                            </analyzer>
                            <detector order="4">
                                    <cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
                            </detector>
                    </componentList>
     </instrumentConfiguration>

小而完整的文件是here

所以我到目前为止所做的就是将findall用于所有感兴趣的元素。

import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
    insattrib=s[ins].attrib
    # It will print out all the id attribute of instrument
    print insattrib["id"]

如何访问instrumentConfiguration（s）元素的所有子/孙？

s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')

我想要的例子

InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector

当命名空间存在时，是否有有效的方法来解析元素/子元素/子元素？或者我每次都必须使用find / findall来访问具有命名空间的树中的特定元素？这只是我必须解析更复杂的元素层次结构的一个小例子。

任何建议！

修改

没有得到正确答案所以必须再次编辑！

Answer 1

这是一个脚本，可以在<instrumentConfiguration/>秒内（在我的机器上）解析一百万967MB个元素（40文件），而不会占用大量内存。

吞吐量为24MB/s。 cElementTree page (2005)报告47MB/s。

#!/usr/bin/env python
from itertools import imap, islice, izip
from operator  import itemgetter
from xml.etree import cElementTree as etree

def parsexml(filename):
    it = imap(itemgetter(1),
              iter(etree.iterparse(filename, events=('start',))))
    root = next(it) # get root element
    for elem in it:
        if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
            values = [('Id', elem.get('id')),
                      ('Parameter1', next(it).get('name'))] # cvParam
            componentList_count = int(next(it).get('count'))
            for parent, child in islice(izip(it, it), componentList_count):
                key = parent.tag.partition('}')[2]
                value = child.get('name')
                assert child.tag.endswith('cvParam')
                values.append((key, value))
            yield values
            root.clear() # preserve memory

def print_values(it):
    for line in (': '.join(val) for conf in it for val in conf):
        print(line)

print_values(parsexml(filename))

输出

$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps

注意：代码很脆弱它假定<instrumentConfiguration/>的前两个孩子是<cvParam/>和<componentList/>，并且所有值都可用作标记名称或属性。

关于表现

在这种情况下，ElementTree 1.3比cElementTree 1.0.6慢〜6倍。

如果将root.clear()替换为elem.clear()，则代码速度提高约10％，但内存增加约10倍。 lxml.etree适用于elem.clear()变体，其效果与cElementTree相同，但消耗的内存量为20（root.clear()）/ 2（elem.clear()）倍（500MB）。

Answer 2

如果这仍然是当前问题，您可以尝试pymzML，一个mzML文件的python接口。网站： http://pymzml.github.com/

Answer 3

在这种情况下，我会找到findall来查找所有的instrumentList元素。然后在这些结果上只需访问数据，就像instrumentList和仪器是数组一样，你得到所有的元素，而不必全部搜索它们。

Answer 4

如果您的文件很大，请查看iterparse()功能。请务必阅读this article 作者：elementtree的作者，尤其是关于“增量解析”的部分。

Answer 5

我知道这是旧的，但我在进行XML解析时遇到了这个问题，我的XML文件非常庞大。

J.F。塞巴斯蒂安的回答确实是正确的，但出现了以下问题。

我注意到，如果遍历start属性，有时elem.text中的值（如果你有XML内的值而不是属性）则无法正确读取（有时返回None）。我不得不重复一遍＆＃39;结束＆＃39;像这样

it = imap(itemgetter(1),
          iter(etree.iterparse(filename, events=('end',))))
root = next(it) # get root element

如果有人想要将文本放在xml标签内（而不是属性），也许他应该遍历＆＃39;结束＆＃39;事件而非开始＆＃39;

但是，如果所有值都在属性中，那么J.F.Sebastian的答案中的代码更为正确。

我案例的XML示例：

<data>
<country>
    <name>Liechtenstein</name>
    <rank>1</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
</country>
<country>
    <name>Singapore</name>
    <rank>4</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
</country>
<country>
    <name>Panama</name>
    <rank>68</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
</country>

在ElementTree（1.3.0）Python中进行XML解析的有效方法

5 个答案:

输出

关于表现