<Database>
<BlogPost>
<Date>MM/DD/YY</Date>
<Author>Last Name, Name</Author>
<Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
</BlogPost>
<BlogPost>
<Date>MM/DD/YY</Date>
<Author>Last Name, Name</Author>
<Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
</BlogPost>
[...]
<BlogPost>
<Date>MM/DD/YY</Date>
<Author>Last Name, Name</Author>
<Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
</BlogPost>
</Database>
文件text.xml超过15gb,我希望将其拆分为较小的文件 从标签到
这是我的尝试,但这需要花费很长时间超过5分钟而且没有结果 如果我在这里做了一些根本错误的事情,那么任何想法
from lxml import etree
def fast_iter(context, func):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem):
print (etree.tostring(elem))
xmlFile = r'D:\Test\Test\text.xml'
context = etree.iterparse( xmlFile, tag='BlogPost' )
fast_iter(context,process_element)
我看到我的ipython程序消耗的内存超过2Gb,然后终于说我的XML文件最后有一行无效。 让我想知道即使我的xml文件有一个额外的行,不应该逐步解析文件吗?