在Python中通过块处理XML

时间:2014-01-04 21:49:08

标签: python xml

我正在尝试处理一系列大型XML文件(每个约3GB)。 XML的粗略格式是

<FILE>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
</FILE>

我目前的方法是(模仿在http://effbot.org/zone/element-iterparse.htm#incremental-parsing看到的代码):

#Added this in the edit.
import xml.etree.ElementTree as ET

tree = ET.iterparse(xml_file)
tree = iter(tree)
event, root = tree.next()

for event, elem in tree:
    #Need to find the <DOC> elements
    if event == "end" and elem.tag == "DOC":
        #Code to process the fields within the <DOC> element. 
        #The code here mainly just iterates through the inner 
        #elements and extracts what I need
        root.clear()

然而,这会爆炸,并使用我的所有系统内存(16GB)。起初我以为它是root.clear()的位置,所以我尝试将其移到if语句之后,但这似乎没有任何效果。鉴于此,我非常清楚如何继续“获得更多记忆”。

修改

删除了之前的编辑,因为它错了。

1 个答案:

答案 0 :(得分:4)

我认为如果切换到lxml并执行此操作以清除树,我可以使用您已编写的代码...

from lxml import etree
context = etree.iterparse(xmlfile)  # can also limit to certain events and tags
for event, elem in context:
    # do some stuff here with elem
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

我并不认为这是有效的,但它可能会完成工作。