我是一个蟒蛇新手。我试图使用lxml解析我的python模块中的一个巨大的xml文件。尽管在每个循环结束时清除了元素,但我的记忆仍然会崩溃并导致应用程序崩溃。我相信我在这里遗漏了一些东西。请帮助弄清楚那是什么。
以下是我正在使用的主要功能 -
from lxml import etree
def parseXml(context,attribList):
for _, element in context:
fieldMap={}
rowList=[]
readAttribs(element,fieldMap,attribList)
readAllChildren(element,fieldMap,attribList)
for row in rowList:
yield row
element.clear()
def readAttribs(element,fieldMap,attribList):
for atrrib in attribList:
fieldMap[attrib]=element.get(attrib,'')
def readAllChildren(element,fieldMap,attribList,rowList):
for childElem in element:
readAttribs(childEleme,fieldMap,attribList)
if len(childElem) > 0:
readAllChildren(childElem,fieldMap,attribList)
rowlist.append(fieldMap.copy())
childElem.clear()
def main():
attribList=['name','age','id']
context=etree.iterparse(fullFilePath, events=("start",))
for row in parseXml(context,attribList)
print row
谢谢!
示例xml和嵌套字典 -
<root xmlns='NS'>
<Employee Name="Mr.ZZ" Age="30">
<Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
<Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
<Project Name="ABC_1" Team="4">
</Project>
</Employment>
<Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
<PromotionStatus>Manager</PromotionStatus>
<Project Name="XYZ_1" Team="7">
<Award>Star Team Member</Award>
</Project>
</Employment>
</Experience>
</Employee>
</root>
ELEMENT_NAME='element_name'
ELEMENTS='elements'
ATTRIBUTES='attributes'
TEXT='text'
xmlDef={ 'namespace' : 'NS',
'content' :
{ ELEMENT_NAME: 'Employee',
ELEMENTS: [{ELEMENT_NAME: 'Experience',
ELEMENTS: [{ELEMENT_NAME: 'Employment',
ELEMENTS: [{
ELEMENT_NAME: 'PromotionStatus',
ELEMENTS: [],
ATTRIBUTES:[],
TEXT:['PromotionStatus']
},
{
ELEMENT_NAME: 'Project',
ELEMENTS: [{
ELEMENT_NAME: 'Award',
ELEMENTS: {},
ATTRIBUTES:[],
TEXT:['Award']
}],
ATTRIBUTES:['Name','Team'],
TEXT:[]
}],
ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
TEXT:[]
}],
ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
TEXT:[]
}],
ATTRIBUTES: ['Name','Age'],
TEXT:[]
}
}
答案 0 :(得分:15)
看起来您已经遵循了lxml
,特别是etree.iterparse(..)
的一些好建议,但我认为您的实施正在从错误的角度解决问题。 iterparse(..)
的想法是远离收集和存储数据,而是在读入时处理标记。您的readAllChildren(..)
函数将所有内容保存到rowList
,这会增长并增长到覆盖整个文档树。我做了一些更改来显示正在发生的事情:
from lxml import etree
def parseXml(context,attribList):
for event, element in context:
print "%s element %s:" % (event, element)
fieldMap = {}
rowList = []
readAttribs(element, fieldMap, attribList)
readAllChildren(element, fieldMap, attribList, rowList)
for row in rowList:
yield row
element.clear()
def readAttribs(element, fieldMap, attribList):
for attrib in attribList:
fieldMap[attrib] = element.get(attrib,'')
print "fieldMap:", fieldMap
def readAllChildren(element, fieldMap, attribList, rowList):
for childElem in element:
print "Found child:", childElem
readAttribs(childElem, fieldMap, attribList)
if len(childElem) > 0:
readAllChildren(childElem, fieldMap, attribList, rowList)
rowList.append(fieldMap.copy())
print "len(rowList) =", len(rowList)
childElem.clear()
def process_xml_original(xml_file):
attribList=['name','age','id']
context=etree.iterparse(xml_file, events=("start",))
for row in parseXml(context,attribList):
print "Row:", row
运行一些虚拟数据:
>>> from cStringIO import StringIO
>>> test_xml = """\
... <family>
... <person name="somebody" id="5" />
... <person age="45" />
... <person name="Grandma" age="62">
... <child age="35" id="10" name="Mom">
... <grandchild age="7 and 3/4" />
... <grandchild id="12345" />
... </child>
... </person>
... <something-completely-different />
... </family>
... """
>>> process_xml_original(StringIO(test_xml))
start element: <Element family at 0x105ca58>
fieldMap: {'age': '', 'name': '', 'id': ''}
Found child: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': 'somebody', 'id': '5'}
len(rowList) = 1
Found child: <Element person at 0x105c468>
fieldMap: {'age': '45', 'name': '', 'id': ''}
len(rowList) = 2
Found child: <Element person at 0x105c7b0>
fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''}
Found child: <Element child at 0x106e468>
fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'}
Found child: <Element grandchild at 0x106e148>
fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''}
len(rowList) = 3
Found child: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': '12345'}
len(rowList) = 4
len(rowList) = 5
len(rowList) = 6
Found child: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}
len(rowList) = 7
Row: {'age': '', 'name': 'somebody', 'id': '5'}
Row: {'age': '45', 'name': '', 'id': ''}
Row: {'age': '7 and 3/4', 'name': '', 'id': ''}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c7b0>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element child at 0x106e468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e148>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}
这有点难以阅读,但是您可以看到它在第一遍中从根标签向下攀爬整个树,为整个文档中的每个元素构建rowList
。您还会注意到它甚至没有停在那里,因为element.clear()
调用在<{em} yield
parseXml(..)
中的iterparse(..)
声明之后来了,它不会被执行,直到第二次迭代(即树中的下一个元素)。
一个简单的解决方法是让def do_something_with_data(data):
"""This just prints it out. Yours will probably be more interesting."""
print "Got data: ", data
def process_xml_iterative(xml_file):
# by using the default 'end' event, you start at the _bottom_ of the tree
ATTRS = ('name', 'age', 'id')
for event, element in etree.iterparse(xml_file):
print "%s element: %s" % (event, element)
data = {}
for attr in ATTRS:
data[attr] = element.get(attr, u"")
do_something_with_data(data)
element.clear()
del element # for extra insurance
完成它的工作:迭代解析!以下内容将提取相同的信息并逐步处理:
>>> print test_xml
<family>
<person name="somebody" id="5" />
<person age="45" />
<person name="Grandma" age="62">
<child age="35" id="10" name="Mom">
<grandchild age="7 and 3/4" />
<grandchild id="12345" />
</child>
</person>
<something-completely-different />
</family>
>>> process_xml_iterative(StringIO(test_xml))
end element: <Element person at 0x105cc10>
Got data: {'age': u'', 'name': 'somebody', 'id': '5'}
end element: <Element person at 0x106e468>
Got data: {'age': '45', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e148>
Got data: {'age': '7 and 3/4', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e490>
Got data: {'age': u'', 'name': u'', 'id': '12345'}
end element: <Element child at 0x106e508>
Got data: {'age': '35', 'name': 'Mom', 'id': '10'}
end element: <Element person at 0x106e530>
Got data: {'age': '62', 'name': 'Grandma', 'id': u''}
end element: <Element something-completely-different at 0x106e558>
Got data: {'age': u'', 'name': u'', 'id': u''}
end element: <Element family at 0x105c6e8>
Got data: {'age': u'', 'name': u'', 'id': u''}
在相同的虚拟XML上运行:
'end'
这应该会大大提高脚本的速度和内存性能。此外,通过挂钩{'age': u'', 'id': u'', 'name': u''}
事件,您可以随时清除和删除元素,而不是等到所有孩子都被处理完毕。
根据您的数据集,仅处理某些类型的元素可能是个好主意。例如,根元素可能不是很有意义,其他嵌套元素也可能用很多import xml.sax
class AttributeGrabber(xml.sax.handler.ContentHandler):
"""SAX Handler which will store selected attribute values."""
def __init__(self, target_attrs=()):
self.target_attrs = target_attrs
def startElement(self, name, attrs):
print "Found element: ", name
data = {}
for target_attr in self.target_attrs:
data[target_attr] = attrs.get(target_attr, u"")
# (no xml trees or elements created at all)
do_something_with_data(data)
def process_xml_sax(xml_file):
grabber = AttributeGrabber(target_attrs=('name', 'age', 'id'))
xml.sax.parse(xml_file, grabber)
填充数据集。
顺便说一句,当我读“XML”和“低记忆”时,我的思绪总是直接跳到SAX,这是你可以解决这个问题的另一种方式。使用内置xml.sax
模块:
def process_xml_batch(xml_file, batch_size=10):
ATTRS = ('name', 'age', 'id')
batch = []
for event, element in etree.iterparse(xml_file):
data = {}
for attr in ATTRS:
data[attr] = element.get(attr, u"")
batch.append(data)
element.clear()
del element
if len(batch) == batch_size:
do_something_with_batch(batch)
# Or, if you want this to be a genrator:
# yield batch
batch = []
if batch:
# there are leftover items
do_something_with_batch(batch) # Or, yield batch
您必须根据您的情况最适合的方式评估这两个选项(并且可能运行几个基准测试,如果这是您经常要做的事情)。
请务必跟进事情的进展情况!
实现上述任一解决方案可能需要对代码的整体结构进行一些更改,但您所拥有的任何内容都应该仍然可行。例如,批量处理“行”,您可以:
{{1}}