优雅的Python循环,用于解析扁平XML

时间:2016-05-23 14:53:03

标签: python xml python-3.x for-loop lxml

我使用lxml.objectify解析Python3中的XML文件:

<root>
    <object_header></object_header>
    <object_details></object_details>
    <object_details></object_details>
    <object_header></object_header>
    <object_details></object_details>
    <object_header></object_header>
</root>

请注意,有时对象没有属性。

我目前正在解析这个问题(有效但不优雅)的方法如下:

from lxml import objectify, etree
root = objectify.parse(xmlFile).getroot()
elems = [el for el in root.iterchildren()]
# data is list of objects
data = []
# Have to instantiate outside of for loop in case last object has not details.
objectDetails = ''
# Don't store first object right away.
firstObject = True
# Iterate through each XML element.
for elem in elems:
    if elem.tag == 'object_header':
        # Remember object header info.
        object = storeHeaderInfo(objectDetails)
        # Skip saving if first object, need to grab object details.
        if firstObject == True:
            # Don't skip again, in case object has no details.
            firstObject = False
            continue
        # Save object, already grabbed object details.
        data.append(object)
    else:
        # Process object details in <object_details> tag.
        objectDetails += etree.tostring(elem)
# Save last object.
object = storeHeaderInfo(objectDetails)
data.append(object)

我不喜欢的是我如何编码存储对象两次。一次为for循环中的每个对象,然后再次为最后一个对象。

有更多的pythonic或优雅方式吗?

1 个答案:

答案 0 :(得分:2)

如果您使用following-sibling::*表达式,可以使事情更简单:

from lxml import objectify, etree    

root = objectify.parse("input.xml").getroot()
elems = root.xpath("//object_header")

for elem in elems:
    header = elem.text
    objectDetails = ''
    for sibling in elem.xpath("following-sibling::*"):
        if sibling.tag == 'object_header':
            break

        objectDetails += str(etree.tostring(sibling))

    print(header, objectDetails)

给出以下输入:

<root>
    <object_header>object1</object_header>
    <object_details>detail1</object_details>
    <object_details>detail2</object_details>
    <object_header>object2</object_header>
    <object_details>detail1</object_details>
    <object_header>object3</object_header>
</root>

代码会打印出来:

object1 b'<object_details>detail1</object_details>'b'<object_details>detail2</object_details>'
object2 b'<object_details>detail1</object_details>'
object3