这适用于大小约600mb的文件,大于那个,我的内存不足(我有一台16GB的机器)。我可以做什么来分段读取文件,或者一次读取一定比例的xml,或者是否存在更少的内存密集型方法?
import csv
import xml.etree.ElementTree as ET
from lxml import etree
import time
import sys
def main(argv):
start_time = time.time()
#file_name = 'sample.xml'
file_name = argv
root = ET.ElementTree(file=file_name).getroot()
csv_file_name = '.'.join(file_name.split('.')[:-1]) + ".txt"
print '\n'
print 'Output file:'
print csv_file_name
with open(csv_file_name, 'w') as file_:
writer = csv.writer(file_, delimiter="\t")
header = [ <the names of the tags here> ]
writer.writerow(header)
tags = [
<bunch of xml tags here>
]
#write the values
# for index in range(8,1000):
for index in range(3,len(root)):
#print index
row=[]
for tagindex,val in enumerate(tags):
searchQuery = "tags"+tags[tagindex]
# print searchQuery
# print root[index]
# print root[index].find(searchQuery).text
if (root[index].find(searchQuery) is None) or (root[index].find(searchQuery).text == None):
row.extend([""])
#print tags[tagindex]+" blank"
else:
row.extend([root[index].find(searchQuery).text])
#print tags[tagindex]+" "+root[index].find(searchQuery).text
writer.writerow(row)
#for i,child in enumerate(root):
#print root[i]
print '\nNumber of elements is: %s' % len(root)
print '\nTotal run time: %s seconds' % (time.time() - start_time)
if __name__ == "__main__":
main(sys.argv[1])
答案 0 :(得分:1)
使用 ElementTree.iterparse 来解析XML数据。请参阅文档以获取帮助。
答案 1 :(得分:1)
几点提示:
lxml
,这是非常高效的iterparse
可以逐件处理您的文档但是,iterparse
可能会让您大吃一惊,最终可能导致内存消耗过高。要解决这个问题,您必须清除对我已经处理过的项目的引用,如我最喜欢的关于effective lxml
usage的文章所述
fastiterparse.py
iterparse
安装docopt
和lxml
$ pip install lxml docopt
编写脚本:
"""For all elements with given tag prints value of selected attribute
Usage:
fastiterparse.py <xmlfile> <tag> <attname>
fastiterparse.py -h
"""
from lxml import etree
from functools import partial
def fast_iter(context, func):
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def printattname(elem, attname):
print elem.attrib[attname]
def main(fname, tag, attname):
fun = partial(printattname, attname=attname)
with open(fname) as f:
context = etree.iterparse(f, events=("end",), tag=tag)
fast_iter(context, fun)
if __name__ == "__main__":
from docopt import docopt
args = docopt(__doc__)
main(args["<xmlfile>"], args["<tag>"], args["<attname>"])
尝试调用它:
$ python fastiterparse.py
Usage:
fastiterparse.py <xmlfile> <tag> <attname>
fastiterparse.py -h
使用它(在您的文件上):
$ python fastiterparse.py large.xml ElaboratedRecord id
rec26872
rec25887
rec26873
rec26874
fast_iter
方法)主要内容是fast_iter
功能(或至少记住clear
未使用的元素,删除它们并最终删除context
测量可以显示,在某些情况下,脚本运行速度稍慢,然后没有clear
和del
,但差异并不显着。优势在于内存是限制,因为当它开始交换时,优化版本将变得更快,如果内存耗尽,则没有太多其他选项。
答案 2 :(得分:1)
使用 cElementTree 代替 ElementTree 。
将您的ET import语句替换为:import xml.etree.cElementTree as ET