我想使用xml.etree.ElementTree.iterparse()
来抓取XML文件的各个部分。该文件是60GB和1B行,所以我不想将它全部加载到内存中。我没有看到在xml
库中输出xml的整个子部分的方法。我认识到iterparse
是迭代的,到目前为止可能只是向前看。我怎么能这样做?
from xml.etree.ElementTree import iterparse
context = iterparse("file.xml", events=("start", "end"))
for event, elem in context:
if event == 'start':
if elem.tag == 'page':
# Splice out this subset of the XML, including tags
# Or, better, splice it if `<title>` includes "Foo".
else:
elem.clear()
XML看起来大致如下:
<siteinfo>
<page>
<title>Foo</title>
<text>Bar</text>
</page>
<page>
<title>NotFoo</title>
<text>NotBar</text>
</page>
</siteinfo>
答案 0 :(得分:0)
我尝试了一些东西,它不是你想要的确切输出,只是分享它是否对你有用
application
输出文件:
path='D:\data.xml'
from xml.etree import ElementTree as Et
context = Et.iterparse(path,events=("start", "end"))
root = None
for event, elem in context:
if event=='end' or event=='start':
if elem.text=='Foo':
elem.clear()
root=elem
with open('d:\output.xml', 'wb') as file:
Et.ElementTree(root).write(file, encoding='utf-8', xml_declaration=True)