我需要在Python中编写一个解析器,它可以在没有太多内存(仅2 GB)的计算机上处理一些非常大的文件(> 2 GB)。我想在lxml中使用iterparse来实现它。
我的文件格式为:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
</item>
到目前为止我的解决方案是:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
del context
不幸的是,这个解决方案仍然占用了大量的内存。我认为问题在于,在处理每个“ITEM”后,我需要做一些事情来清理空的孩子。任何人都可以提供一些建议,说明在处理我的数据以进行正确清理后我可能会做些什么吗?
答案 0 :(得分:53)
试试Liza Daly's fast_iter。在处理了一个元素elem
之后,它调用elem.clear()
来删除后代,并删除前面的兄弟。
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elem):
print elem.xpath( 'description/text( )' )
context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)
Daly的文章非常精彩,尤其是在处理大型XML文件时。
修改:上面发布的fast_iter
是Daly fast_iter
的修改版本。处理完元素后,它会更加积极地删除不再需要的其他元素。
下面的脚本显示了行为上的差异。请特别注意orig_fast_iter
不会删除A1
元素,而mod_fast_iter
会删除它,从而节省更多内存。
import lxml.etree as ET
import textwrap
import io
def setup_ABC():
content = textwrap.dedent('''\
<root>
<A1>
<B1></B1>
<C>1<D1></D1></C>
<E1></E1>
</A1>
<A2>
<B2></B2>
<C>2<D></D></C>
<E2></E2>
</A2>
</root>
''')
return content
def study_fast_iter():
def orig_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
while elem.getprevious() is not None:
print('Deleting {p}'.format(
p=(elem.getparent()[0]).tag))
del elem.getparent()[0]
del context
def mod_fast_iter(context, func, *args, **kwargs):
"""
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
Author: Liza Daly
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
print('Checking ancestor: {a}'.format(a=ancestor.tag))
while ancestor.getprevious() is not None:
print(
'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
del ancestor.getparent()[0]
del context
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
orig_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Deleting B2
print('-' * 80)
"""
The improved fast_iter deletes A1. The original fast_iter does not.
"""
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
mod_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Checking ancestor: root
# Checking ancestor: A1
# Checking ancestor: C
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Checking ancestor: root
# Checking ancestor: A2
# Deleting A1
# Checking ancestor: C
# Deleting B2
study_fast_iter()
答案 1 :(得分:4)
iterparse()
允许你在构建树的时候做的东西,这意味着除非你删除了你不再需要的东西,否则你仍然会在整个树中找到整个树。端。
有关更多信息:请阅读原始ElementTree实现的作者this(但它也适用于lxml)
答案 2 :(得分:1)
为什么不使用sax的“回调”方式?
答案 3 :(得分:1)
以我的经验,有或没有element.clear
(请参阅F. Lundh和L. Daly)的iterparse不能总是处理非常大的XML文件:它运行了一段时间,突然间,内存消耗不断增加屋顶并发生内存错误或系统崩溃。如果遇到相同的问题,也许可以使用相同的解决方案: expat解析器。另请参见F. Lundh或以下使用OP的XML代码段的示例(另加两个表示检查是否存在编码问题的文字):
import xml.parsers.expat
from collections import deque
def iter_xml(inpath: str, outpath: str) -> None:
def handle_cdata_end():
nonlocal in_cdata
in_cdata = False
def handle_cdata_start():
nonlocal in_cdata
in_cdata = True
def handle_data(data: str):
nonlocal in_cdata
if not in_cdata and open_tags and open_tags[-1] == 'desc':
data = data.replace('\\', '\\\\').replace('\n', '\\n')
outfile.write(data + '\n')
def handle_endtag(tag: str):
while open_tags:
open_tag = open_tags.pop()
if open_tag == tag:
break
def handle_starttag(tag: str, attrs: 'Dict[str, str]'):
open_tags.append(tag)
open_tags = deque()
in_cdata = False
parser = xml.parsers.expat.ParserCreate()
parser.CharacterDataHandler = handle_data
parser.EndCdataSectionHandler = handle_cdata_end
parser.EndElementHandler = handle_endtag
parser.StartCdataSectionHandler = handle_cdata_start
parser.StartElementHandler = handle_starttag
with open(inpath, 'rb') as infile:
with open(outpath, 'w', encoding = 'utf-8') as outfile:
parser.ParseFile(infile)
iter_xml('input.xml', 'output.txt')
input.xml:
<root>
<item>
<title>Item 1</title>
<desc>Description 1ä</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2ü</desc>
</item>
</root>
output.txt:
Description 1ä
Description 2ü
答案 4 :(得分:0)
请注意,iterparse仍然构建一个树,就像解析一样,但是您可以在解析时安全地重新排列或删除树的一部分。例如,要解析大文件,您可以在处理完元素后立即删除元素:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
上述模式有一个缺点;它不会清除根元素,因此您最终会得到一个包含许多空子元素的元素。如果您的文件很大,而不是很大,这可能是个问题。要解决这个问题,您需要掌握根元素。最简单的方法是启用启动事件,并保存对变量中第一个元素的引用:
context = iterparse(source, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()
所以这是一个增量解析问题,This link can give you detailed answer总结回答你可以参考上面的
答案 5 :(得分:0)
root.clear()方法的唯一问题是返回NoneTypes。这意味着您不能使用诸如replace()或title()之类的字符串方法编辑您解析的数据。也就是说,如果您只是按原样解析数据,这是一种最佳使用方法。