在python中解析巨大的xml时的lxml内存使用情况

时间:2011-06-20 23:07:11

标签: python lxml

我是一个蟒蛇新手。我试图使用lxml解析我的python模块中的一个巨大的xml文件。尽管在每个循环结束时清除了元素,但我的记忆仍然会崩溃并导致应用程序崩溃。我相信我在这里遗漏了一些东西。请帮助弄清楚那是什么。

以下是我正在使用的主要功能 -

from lxml import etree
def parseXml(context,attribList):
    for _, element in context:
        fieldMap={}
        rowList=[]
        readAttribs(element,fieldMap,attribList)
        readAllChildren(element,fieldMap,attribList)
        for row in rowList:
            yield row
        element.clear()

def readAttribs(element,fieldMap,attribList):
    for atrrib in attribList:
        fieldMap[attrib]=element.get(attrib,'')

def readAllChildren(element,fieldMap,attribList,rowList):
    for childElem in element:
        readAttribs(childEleme,fieldMap,attribList)
        if len(childElem) > 0:
           readAllChildren(childElem,fieldMap,attribList)
        rowlist.append(fieldMap.copy())
        childElem.clear()

def main():
    attribList=['name','age','id']
    context=etree.iterparse(fullFilePath, events=("start",))
    for row in parseXml(context,attribList)
        print row 

谢谢!

示例xml和嵌套字典 -

<root xmlns='NS'>
        <Employee Name="Mr.ZZ" Age="30">
            <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
                    <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
                            <Project Name="ABC_1" Team="4">
                            </Project>
                    </Employment>
                    <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
                        <PromotionStatus>Manager</PromotionStatus>
                            <Project Name="XYZ_1" Team="7">
                                <Award>Star Team Member</Award>
                            </Project>
                    </Employment>
            </Experience>
        </Employee>
</root>

ELEMENT_NAME='element_name'
ELEMENTS='elements'
ATTRIBUTES='attributes'
TEXT='text'
xmlDef={ 'namespace' : 'NS',
           'content' :
           { ELEMENT_NAME: 'Employee',
             ELEMENTS: [{ELEMENT_NAME: 'Experience',
                         ELEMENTS: [{ELEMENT_NAME: 'Employment',
                                     ELEMENTS: [{
                                                 ELEMENT_NAME: 'PromotionStatus',
                                                 ELEMENTS: [],
                                                 ATTRIBUTES:[],
                                                 TEXT:['PromotionStatus']
                                               },
                                               {
                                                 ELEMENT_NAME: 'Project',
                                                 ELEMENTS: [{
                                                            ELEMENT_NAME: 'Award',
                                                            ELEMENTS: {},
                                                            ATTRIBUTES:[],
                                                            TEXT:['Award']
                                                            }],
                                                 ATTRIBUTES:['Name','Team'],
                                                 TEXT:[]
                                               }],
                                     ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
                                     TEXT:[]
                                    }],
                         ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
                         TEXT:[]
                         }],
             ATTRIBUTES: ['Name','Age'],
             TEXT:[]
           }
         }

1 个答案:

答案 0 :(得分:15)

欢迎使用Python和Stack Overflow!

看起来您已经遵循了lxml,特别是etree.iterparse(..)的一些好建议,但我认为您的实施正在从错误的角度解决问题。 iterparse(..)的想法是远离收集和存储数据,而是在读入时处理标记。您的readAllChildren(..)函数将所有内容保存到rowList,这会增长并增长到覆盖整个文档树。我做了一些更改来显示正在发生的事情:

from lxml import etree
def parseXml(context,attribList):
    for event, element in context:
        print "%s element %s:" % (event, element)
        fieldMap = {}
        rowList = []
        readAttribs(element, fieldMap, attribList)
        readAllChildren(element, fieldMap, attribList, rowList)
        for row in rowList:
            yield row
        element.clear()

def readAttribs(element, fieldMap, attribList):
    for attrib in attribList:
        fieldMap[attrib] = element.get(attrib,'')
    print "fieldMap:", fieldMap

def readAllChildren(element, fieldMap, attribList, rowList):
    for childElem in element:
        print "Found child:", childElem
        readAttribs(childElem, fieldMap, attribList)
        if len(childElem) > 0:
           readAllChildren(childElem, fieldMap, attribList, rowList)
        rowList.append(fieldMap.copy())
        print "len(rowList) =", len(rowList)
        childElem.clear()

def process_xml_original(xml_file):
    attribList=['name','age','id']
    context=etree.iterparse(xml_file, events=("start",))
    for row in parseXml(context,attribList):
        print "Row:", row

运行一些虚拟数据:

>>> from cStringIO import StringIO
>>> test_xml = """\
... <family>
...     <person name="somebody" id="5" />
...     <person age="45" />
...     <person name="Grandma" age="62">
...         <child age="35" id="10" name="Mom">
...             <grandchild age="7 and 3/4" />
...             <grandchild id="12345" />
...         </child>
...     </person>
...     <something-completely-different />
... </family>
... """
>>> process_xml_original(StringIO(test_xml))
start element: <Element family at 0x105ca58>
fieldMap: {'age': '', 'name': '', 'id': ''}
Found child: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': 'somebody', 'id': '5'}
len(rowList) = 1
Found child: <Element person at 0x105c468>
fieldMap: {'age': '45', 'name': '', 'id': ''}
len(rowList) = 2
Found child: <Element person at 0x105c7b0>
fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''}
Found child: <Element child at 0x106e468>
fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'}
Found child: <Element grandchild at 0x106e148>
fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''}
len(rowList) = 3
Found child: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': '12345'}
len(rowList) = 4
len(rowList) = 5
len(rowList) = 6
Found child: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}
len(rowList) = 7
Row: {'age': '', 'name': 'somebody', 'id': '5'}
Row: {'age': '45', 'name': '', 'id': ''}
Row: {'age': '7 and 3/4', 'name': '', 'id': ''}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c7b0>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element child at 0x106e468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e148>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}

这有点难以阅读,但是您可以看到它在第一遍中从根标签向下攀爬整个树,为整个文档中的每个元素构建rowList。您还会注意到它甚至没有停在那里,因为element.clear()调用在<{em} yield parseXml(..)中的iterparse(..)声明之后来了,它不会被执行,直到第二次迭代(即树中的下一个元素)。

增量处理FTW

一个简单的解决方法是让def do_something_with_data(data): """This just prints it out. Yours will probably be more interesting.""" print "Got data: ", data def process_xml_iterative(xml_file): # by using the default 'end' event, you start at the _bottom_ of the tree ATTRS = ('name', 'age', 'id') for event, element in etree.iterparse(xml_file): print "%s element: %s" % (event, element) data = {} for attr in ATTRS: data[attr] = element.get(attr, u"") do_something_with_data(data) element.clear() del element # for extra insurance 完成它的工作:迭代解析!以下内容将提取相同的信息并逐步处理:

>>> print test_xml
<family>
    <person name="somebody" id="5" />
    <person age="45" />
    <person name="Grandma" age="62">
        <child age="35" id="10" name="Mom">
            <grandchild age="7 and 3/4" />
            <grandchild id="12345" />
        </child>
    </person>
    <something-completely-different />
</family>
>>> process_xml_iterative(StringIO(test_xml))
end element: <Element person at 0x105cc10>
Got data:  {'age': u'', 'name': 'somebody', 'id': '5'}
end element: <Element person at 0x106e468>
Got data:  {'age': '45', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e148>
Got data:  {'age': '7 and 3/4', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e490>
Got data:  {'age': u'', 'name': u'', 'id': '12345'}
end element: <Element child at 0x106e508>
Got data:  {'age': '35', 'name': 'Mom', 'id': '10'}
end element: <Element person at 0x106e530>
Got data:  {'age': '62', 'name': 'Grandma', 'id': u''}
end element: <Element something-completely-different at 0x106e558>
Got data:  {'age': u'', 'name': u'', 'id': u''}
end element: <Element family at 0x105c6e8>
Got data:  {'age': u'', 'name': u'', 'id': u''}

在相同的虚拟XML上运行:

'end'

这应该会大大提高脚本的速度和内存性能。此外,通过挂钩{'age': u'', 'id': u'', 'name': u''}事件,您可以随时清除和删除元素,而不是等到所有孩子都被处理完毕。

根据您的数据集,仅处理某些类型的元素可能是个好主意。例如,根元素可能不是很有意义,其他嵌套元素也可能用很多import xml.sax class AttributeGrabber(xml.sax.handler.ContentHandler): """SAX Handler which will store selected attribute values.""" def __init__(self, target_attrs=()): self.target_attrs = target_attrs def startElement(self, name, attrs): print "Found element: ", name data = {} for target_attr in self.target_attrs: data[target_attr] = attrs.get(target_attr, u"") # (no xml trees or elements created at all) do_something_with_data(data) def process_xml_sax(xml_file): grabber = AttributeGrabber(target_attrs=('name', 'age', 'id')) xml.sax.parse(xml_file, grabber) 填充数据集。


或者,使用SAX

顺便说一句,当我读“XML”和“低记忆”时,我的思绪总是直接跳到SAX,这是你可以解决这个问题的另一种方式。使用内置xml.sax模块:

def process_xml_batch(xml_file, batch_size=10):
    ATTRS = ('name', 'age', 'id')
    batch = []
    for event, element in etree.iterparse(xml_file):
        data = {}
        for attr in ATTRS:
            data[attr] = element.get(attr, u"")
        batch.append(data)
        element.clear()
        del element

        if len(batch) == batch_size:
            do_something_with_batch(batch)
            # Or, if you want this to be a genrator:
            # yield batch
            batch = []
    if batch:
        # there are leftover items
        do_something_with_batch(batch) # Or, yield batch

您必须根据您的情况最适合的方式评估这两个选项(并且可能运行几个基准测试,如果这是您经常要做的事情)。


请务必跟进事情的进展情况!


根据后续评论进行编辑

实现上述任一解决方案可能需要对代码的整体结构进行一些更改,但您所拥有的任何内容都应该仍然可行。例如,批量处理“行”,您可以:

{{1}}