我有一个xml文件,其中包含我要放入tar.gz文件的文件的目录结构(展平)。
我应该如何解析xml以提取每个文件的路径?
现在我正在使用lxml并找到这样的路径:
paths = []
for case in root.iter('case'):
for language in case.iter('language'):
for result in language.iter('result'):
for file in result.iter('file'):
paths.append('/'.join([node.get('id') for node in [case, language, result, file]]))
但是这感觉有点过于硬编码,如果结构发生变化,它就无法正常工作。
我可以使用root.iter(' file')找到每个文件节点,但是如何获取每个节点/文件的所有父节点/目录?或者我应该这样做(完全?)不同的方式?
xml看起来像这样:
<?xml version="1.0" encoding="UTF-8"?>
<files batch="regular">
<case id="case_10_some_description">
<language id="english">
<result id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
<file id="screenshot_4.png"/>
<file id="screenshot_5.png"/>
<file id="screenshot_6.png"/>
</result>
</language>
</case>
<case id="case_12_some_description">
<language id="english">
<result id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
</result>
</language>
</case>
</files>
这是文件:
regular/case_10_some_description/english/images/screenshot_1.png
regular/case_10_some_description/english/images/screenshot_2.png
regular/case_10_some_description/english/images/screenshot_3.png
regular/case_10_some_description/english/images/screenshot_4.png
regular/case_10_some_description/english/images/screenshot_5.png
regular/case_10_some_description/english/images/screenshot_6.png
regular/case_12_some_description/english/images/screenshot_1.png
regular/case_12_some_description/english/images/screenshot_2.png
regular/case_12_some_description/english/images/screenshot_3.png
答案 0 :(得分:1)
您是否自己创建此文件架构?如果你能改变它,我肯定会。 尝试做这样的事情:
<?xml version="1.0" encoding="UTF-8"?>
<Directory id="regular">
<Directory id="case_10_some_description">
<Directory id="english">
<Directory id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
<file id="screenshot_4.png"/>
<file id="screenshot_5.png"/>
<file id="screenshot_6.png"/>
</Directory>
</Directory>
</Directory>
<Directory id="case_12_some_description">
<Directory id="english">
<Directory id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
</Directory>
</Directory>
</Directory>
</Directory>
如果标签具有相同的含义,则始终给它们指定相同的名称。也许使用比标记更多不同的属性,可以使您的解析更容易
答案 1 :(得分:0)
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for file in root.iter('file'):
print 'regular/case_10_some_description/english/images/'+file.attrib['id']