在Python中解析XML,而无需手动调用属性,标签和子编号

时间:2018-07-18 12:16:59

标签: python xml parsing

我想创建一个Python脚本,该脚本从XML树的根部开始遍历每个子元素,并以相同的顺序扫描标签,属性和包含文本。理想情况下,每个节点中的所有标记名称都应与属性键和子节点的标记名称连接在一起,以保持连贯性并更好地理解文本。

因此在下面的示例中,由ElementTree

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

最佳结果将是

country.name Liechtenstein
country.rank 1
country.year 2008
country.gdppc 141100
country.neighbor.name Austria
country.neighbor.direction E
country.neighbor.name Switzerland
country.neighbor.direction W
country.name Singapore
country.rank 4
country.year 2011
country.gdppc 59900
country.neighbor.name Malaysia
country.neighbor.direction N
country.name Panama
country.rank 68
country.year 2011
country.gdppc 13600
country.neighbor.name Costa Rica
country.neighbor.direction W
country.neighbor.name Colombia
country.neighbor.direction E

我一直在使用的脚本明显缺乏自动化实用程序,因为它不计算每个步骤中的对象(标签属性,文本),只有子标签可以正常工作,只要您可以定义其深度即可(在这种情况下,2个为2个循环)。如您所见,文本是分开的,不应该分开,不包含任何条目,但需要排除它们。

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

for child in root:
    print(child.tag, child.attrib.keys(), child.attrib.get('name'))
    for child1 in child:
        print(child1.tag, child1.attrib.items())

for i in range(0,3):
    for j in range(0,3):
        print(root[i][j].text)

输出是...

country dict_keys(['name']) Liechtenstein
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Austria'), ('direction', 'E')])
neighbor dict_items([('name', 'Switzerland'), ('direction', 'W')])
country dict_keys(['name']) Singapore
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Malaysia'), ('direction', 'N')])
country dict_keys(['name']) Panama
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Costa Rica'), ('direction', 'W')])
neighbor dict_items([('name', 'Colombia'), ('direction', 'E')])
1
2008
141100
4
2011
59900
68
2011
13600

1 个答案:

答案 0 :(得分:1)

我觉得应该有一个更好的库来处理xml文件,但是我还没有找到。也许那里还有改进的空间。无论如何,这是我想出的一个解决方案-想法是使用递归函数从每个元素中提取尽可能多的细节,然后将其返回到上一层。

import xml.etree.ElementTree as ET

xml = ET.parse('p.xml')

root = xml.getroot()

def getDataRecursive(element):
    data = list()

    # get attributes of element, necessary for all elements
    for key in element.attrib.keys():
        data.append(element.tag + '.' + key + ' ' + element.attrib.get(key))

    # only end-of-line elements have important text, at least in this example
    if len(element) == 0:
        if element.text is not None:
            data.append(element.tag + ' ' + element.text)

    # otherwise, go deeper and add to the current tag
    else:
        for el in element:
            within = getDataRecursive(el)

            for data_point in within:
                data.append(element.tag + '.' + data_point)

    return data

# print results
for x in getDataRecursive(root):
    print(x)