如何在xml响应中搜索某些字符串

时间:2016-05-11 22:36:55

标签: python xml amazon-s3

我正在使用urllib2库来访问我拥有的s3存储桶。我得到了一个xml结构。问题是我想在该结构中找到他们的Key以“part - ”

开头的节点

我想在列表/数组中提取并保存它们,然后循环读取这些文件的内容

xml响应的一部分

<Contents>
<Key>output/part-00000</Key>
<LastModified>2016-05-11T17:01:19.000Z</LastModified>
<ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
<Size>0</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
<Contents>
<Key>output/part-00001</Key>
<LastModified>2016-05-11T17:01:15.000Z</LastModified>
<ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
<Size>0</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>

现在我正在做以下

import xml.etree.ElementTree as ET

f = urllib2.urlopen("https://s3.amazonaws.com/*******")

tree = ET.parse(f)
root = tree.getroot()

for child in root:
    print child

输出

<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Name' at 0x103a325d0>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Prefix' at 0x103a32610>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Marker' at 0x103a32690>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}MaxKeys' at 0x103a32710>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}IsTruncated' at 0x103a32750>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a32790>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a32950>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a32b10>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a32cd0>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a32e90>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3e090>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3e250>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3e410>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3e5d0>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3e790>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3e950>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3eb10>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3ecd0>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a3ee90>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a47090>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a47250>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a47410>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a475d0>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a47790>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a47950>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a47b10>
<Element '{http://s3.amazonaws.com/doc/2006-03-01/}Contents' at 0x103a47cd0>

我已经尝试过使用minidom和xml.etree.ElementTree的各种解决方案,但我做得不对。

所以我想要的是循环遍历那些xml节点找到part-*****的所有引用并将它们保存在一个数组中。

欢迎任何帮助/线索

1 个答案:

答案 0 :(得分:0)

我的解决方案

f = urllib2.urlopen("https://s3.amazonaws.com/******")

tree = ET.parse(f)
root = tree.getroot()

for child in root.findall('{http://s3.amazonaws.com/doc/2006-03-01/}Contents'):
    for key in child.findall("{http://s3.amazonaws.com/doc/2006-03-01/}Key"):
        print key.text