我一直在尝试使用iterparse来减少需要处理大型XML文档的脚本的内存占用量。这是一个例子。我编写了这个简单的脚本来读取TMX文件并将其拆分为一个或多个输出文件,不超过用户指定的大小。尽管使用iterparse,当我将886MB文件拆分为100MB文件时,脚本会以所有可用内存运行(使用我的8MB中的6.5进行爬行)。
我做错了吗?为什么内存使用量如此之高?
#! /usr/bin/python
# -*- coding: utf-8 -*-
import argparse
import codecs
from xml.etree.ElementTree import iterparse, tostring
from sys import getsizeof
def startNewOutfile(infile, i, root, header):
out = open(infile.replace('tmx', str(i) + '.tmx'), 'w')
print >>out, '<?xml version="1.0" encoding="UTF-8"?>'
print >>out, '<!DOCTYPE tmx SYSTEM "tmx14.dtd">'
print >>out, roottxt
print >>out, headertxt
print >>out, '<body>'
return out
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-m', '--maxsize', dest='maxsize', required=True, type=float, help='max size (in MB) of output files')
parser.add_argument(dest='infile', help='.tmx file to be split')
args = parser.parse_args()
maxsize = args.maxsize * 1024 * 1024
nodes = iter(iterparse(args.infile, events=['start','end']))
_, root = next(nodes)
_, header = next(nodes)
roottxt = tostring(root).strip()
headertxt = tostring(header).strip()
i = 1
curr_size = getsizeof(roottxt) + getsizeof(headertxt)
out = startNewOutfile(args.infile, i, roottxt, headertxt)
for event, node in nodes:
if event =='end' and node.tag == 'tu':
nodetxt = tostring(node, encoding='utf-8').strip()
curr_size += getsizeof(nodetxt)
print >>out, nodetxt
if curr_size > maxsize:
curr_size = getsizeof(roottxt) + getsizeof(headertxt)
print >>out, '</body>'
print >>out, '</tmx>'
out.close()
i += 1
out = startNewOutfile(args.infile, i, roottxt, headertxt)
root.clear()
print >>out, '</body>'
print >>out, '</tmx>'
out.close()
答案 0 :(得分:5)
在相关问题中找到答案:Why is elementtree.ElementTree.iterparse using so much memory?
在for循环的每次迭代中,不仅需要root.clear(),还需要node.clear()。因为我们正在处理启动和启动但是,我们需要注意不要过早删除tu节点:
for e, node in nodes:
if e == 'end' and node.tag == 'tu':
nodetxt = tostring(node, encoding='utf-8').strip()
curr_size += getsizeof(nodetxt)
print >>out, nodetxt
node.clear()
if curr_size > maxsize:
curr_size = getsizeof(roottxt) + getsizeof(headertxt)
print >>out, '</body>'
print >>out, '</tmx>'
out.close()
i += 1
out = startNewOutfile(args.infile, i, roottxt, headertxt)
root.clear()