ElementTree iterparse策略

时间:2012-10-09 04:51:47

标签: python xml sax elementtree iterparse

我必须处理足够大的xml文档(最大1GB)并使用python解析它们。我正在使用iterparse()函数(SAX样式解析)。

我关注的是以下内容,假设你有一个像这样的xml

<?xml version="1.0" encoding="UTF-8" ?>
<families>
  <family>
    <name>Simpson</name>
    <members>
        <name>Homer</name>
        <name>Marge</name>
        <name>Bart</name>
    </members>
  </family>
  <family>
    <name>Griffin</name>
    <members>
        <name>Peter</name>
        <name>Brian</name>
        <name>Meg</name>
    </members>
  </family>
</families>

问题是,当然要知道我什么时候得到一个姓氏(如辛普森一家),当我得到一个家庭成员的名字时(例如荷马)

到目前为止,我一直在做的是使用“开关”,告诉我是否在“成员”标签内,代码看起来像这样

import xml.etree.cElementTree as ET

__author__ = 'moriano'

file_path = "test.xml"
context = ET.iterparse(file_path, events=("start", "end"))

# turn it into an iterator
context = iter(context)
on_members_tag = False
for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()

    if event == 'start' :
        if tag == "members" :
            on_members_tag = True

        elif tag == 'name' :
            if on_members_tag :
                print "The member of the family is %s" % value
            else :
                print "The family is %s " % value

    if event == 'end' and tag =='members' :
        on_members_tag = False
    elem.clear()

这可以正常输出

The family is Simpson 
The member of the family is Homer
The member of the family is Marge
The member of the family is Bart
The family is Griffin 
The member of the family is Peter
The member of the family is Brian
The member of the family is Meg

我担心的是,通过这个(简单)示例,我必须创建一个额外的变量来知道我在哪个标签(on_members_tag)中想象我必须处理的真正的xml示例,它们有更多的嵌套标签。

另请注意,这是一个非常简化的示例,因此您可以假设我可能面对包含更多标记的xml,更多内部标记并尝试获取不同的标记名称,属性等。

所以问题是。我在这里做了一件非常愚蠢的事吗?我觉得必须有一个更优雅的解决方案。

2 个答案:

答案 0 :(得分:27)

这是一种可能的方法:我们维护一个路径列表并向后查看以查找父节点。

path = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if event == 'start':
        path.append(elem.tag)
    elif event == 'end':
        # process the tag
        if elem.tag == 'name':
            if 'members' in path:
                print 'member'
            else:
                print 'nonmember'
        path.pop()

答案 1 :(得分:14)

pulldom非常适合这一点。你得到一个萨克斯流。您可以遍历流,当您找到您感兴趣的节点时,将该节点加载到dom片段中。

import xml.dom.pulldom as pulldom
import xpath # from http://code.google.com/p/py-dom-xpath/

events = pulldom.parse('families.xml')
for event, node in events:
    if event == 'START_ELEMENT' and node.tagName=='family':
        events.expandNode(node) # node now contains a dom fragment
        family_name = xpath.findvalue('name', node)
        members = xpath.findvalues('members/name', node)
        print('family name: {0}, members: {1}'.format(family_name, members))

输出:

family name: Simpson, members: [u'Hommer', u'Marge', u'Bart']
family name: Griffin, members: [u'Peter', u'Brian', u'Meg']