我正在尝试处理一系列大型XML文件(每个约3GB)。 XML的粗略格式是
<FILE>
<DOC>
<FIELD1>
Some text.
</FIELD1>
<FIELD2>
Some text. Probably some more fields nested within this one.
</FIELD2>
<FIELD3>
Some text.
</FIELD3>
<FIELD4>
Some text. Etc.
</FIELD4>
</DOC>
<DOC>
<FIELD1>
Some text.
</FIELD1>
<FIELD2>
Some text. Probably some more fields nested within this one.
</FIELD2>
<FIELD3>
Some text.
</FIELD3>
<FIELD4>
Some text. Etc.
</FIELD4>
</DOC>
</FILE>
我目前的方法是(模仿在http://effbot.org/zone/element-iterparse.htm#incremental-parsing看到的代码):
#Added this in the edit.
import xml.etree.ElementTree as ET
tree = ET.iterparse(xml_file)
tree = iter(tree)
event, root = tree.next()
for event, elem in tree:
#Need to find the <DOC> elements
if event == "end" and elem.tag == "DOC":
#Code to process the fields within the <DOC> element.
#The code here mainly just iterates through the inner
#elements and extracts what I need
root.clear()
然而,这会爆炸,并使用我的所有系统内存(16GB)。起初我以为它是root.clear()
的位置,所以我尝试将其移到if语句之后,但这似乎没有任何效果。鉴于此,我非常清楚如何继续“获得更多记忆”。
修改:
删除了之前的编辑,因为它错了。
答案 0 :(得分:4)
我认为如果切换到lxml
并执行此操作以清除树,我可以使用您已编写的代码...
from lxml import etree
context = etree.iterparse(xmlfile) # can also limit to certain events and tags
for event, elem in context:
# do some stuff here with elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
我并不认为这是有效的,但它可能会完成工作。