如何用目录结构解析xml文件

时间:2013-09-04 09:08:07

标签: python xml

我有一个xml文件,其中包含我要放入tar.gz文件的文件的目录结构(展平)。

我应该如何解析xml以提取每个文件的路径?

现在我正在使用lxml并找到这样的路径:

paths = []
for case in root.iter('case'):
    for language in case.iter('language'):
        for result in language.iter('result'):
            for file in result.iter('file'):
                paths.append('/'.join([node.get('id') for node in [case, language, result, file]]))

但是这感觉有点过于硬编码,如果结构发生变化,它就无法正常工作。

我可以使用root.iter(' file')找到每个文件节点,但是如何获取每个节点/文件的所有父节点/目录?或者我应该这样做(完全?)不同的方式?

xml看起来像这样:

<?xml version="1.0" encoding="UTF-8"?>
<files batch="regular">
    <case id="case_10_some_description">
        <language id="english">
            <result id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
                <file id="screenshot_4.png"/>
                <file id="screenshot_5.png"/>
                <file id="screenshot_6.png"/>
            </result>
        </language>
    </case>
    <case id="case_12_some_description">
        <language id="english">
            <result id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
            </result>
        </language>
    </case>
</files>

这是文件:

regular/case_10_some_description/english/images/screenshot_1.png
regular/case_10_some_description/english/images/screenshot_2.png
regular/case_10_some_description/english/images/screenshot_3.png
regular/case_10_some_description/english/images/screenshot_4.png
regular/case_10_some_description/english/images/screenshot_5.png
regular/case_10_some_description/english/images/screenshot_6.png
regular/case_12_some_description/english/images/screenshot_1.png
regular/case_12_some_description/english/images/screenshot_2.png
regular/case_12_some_description/english/images/screenshot_3.png

2 个答案:

答案 0 :(得分:1)

您是否自己创建此文件架构?如果你能改变它,我肯定会。 尝试做这样的事情:

<?xml version="1.0" encoding="UTF-8"?>
<Directory id="regular">
    <Directory id="case_10_some_description">
        <Directory id="english">
            <Directory id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
                <file id="screenshot_4.png"/>
                <file id="screenshot_5.png"/>
                <file id="screenshot_6.png"/>
            </Directory>
        </Directory>
    </Directory>
    <Directory id="case_12_some_description">
        <Directory id="english">
            <Directory id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
            </Directory>
        </Directory>
    </Directory>
</Directory>

如果标签具有相同的含义,则始终给它们指定相同的名称。也许使用比标记更多不同的属性,可以使您的解析更容易

答案 1 :(得分:0)

import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for file in root.iter('file'):
    print 'regular/case_10_some_description/english/images/'+file.attrib['id']