Question

我正在尝试处理一系列大型XML文件（每个约3GB）。 XML的粗略格式是

<FILE>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
</FILE>

我目前的方法是（模仿在http://effbot.org/zone/element-iterparse.htm#incremental-parsing看到的代码）：

#Added this in the edit.
import xml.etree.ElementTree as ET

tree = ET.iterparse(xml_file)
tree = iter(tree)
event, root = tree.next()

for event, elem in tree:
    #Need to find the <DOC> elements
    if event == "end" and elem.tag == "DOC":
        #Code to process the fields within the <DOC> element. 
        #The code here mainly just iterates through the inner 
        #elements and extracts what I need
        root.clear()

然而，这会爆炸，并使用我的所有系统内存（16GB）。起初我以为它是root.clear()的位置，所以我尝试将其移到if语句之后，但这似乎没有任何效果。鉴于此，我非常清楚如何继续“获得更多记忆”。

修改：

删除了之前的编辑，因为它错了。

Answer 1

我认为如果切换到lxml并执行此操作以清除树，我可以使用您已编写的代码...

from lxml import etree
context = etree.iterparse(xmlfile)  # can also limit to certain events and tags
for event, elem in context:
    # do some stuff here with elem
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

我并不认为这是有效的，但它可能会完成工作。

在Python中通过块处理XML

1 个答案: