如何从数组中的aws s3存储桶中获取排序列表

时间:2015-10-17 22:35:26

标签: arrays python-2.7 sorting amazon-s3 boto

我正在尝试使用boto从aws s3存储桶中获取对象列表。此列表由两个不同列表的共同元素组成。我希望这个列表按照S3桶的升序顺序按对象的“last_modified”排序。意思是,我希望旧对象(基于日期)在我的列表中排在第一位。所以,我正在尝试准备这样的5个元素的列表。我想获取此列表并仅处理属于此列表的那些文件,并最终删除这些文件并以相同方式拾取下一个5个元素的列表。

以下是存储桶层次结构: -

//ship-my-data/outputs/444556677788.tar.gz
//ship-my-data/outputs/444556677788.tar.gz
//ship-my-data/outputs/345345345353.tar.gz

//ship-my-data/outputs1/ctrlFiles/ 444556677788.ctrl.tar.gz
//ship-my-data/outputs1/ctrlFiles/ 123222333444.ctrl.tar.gz
//ship-my-data/outputs1/ctrlFiles/ 769797977979.ctrl.tar.gz

我想列出上面两个文件夹中的常用元素列表,即来自outputs1& ctrlFiles文件夹。

这是我的代码:

bucket = LogShip._aws_connection.get_bucket(aws_bucket_to_download) #Connecting to AWS s3 bucket

bucket_list_ctrl = bucket.list(prefix='outputs/ctrlFiles/', delimiter='/') #get the bucket list for control files.
ctrl_list = sorted(bucket_list_ctrl, key=lambda item1: item1.last_modified) # sort the list by last_modified date.

bucket_list_tar = bucket.list(prefix='outputs/', delimiter='/') #get the list for tar files.
tar_list = sorted(bucket_list_tar, key=lambda item2: item2.last_modified) #suppose to get the bucket list, but throwing an error #AttributeError: 'Prefix' object has no attribute 'last_modified'""

for item_c in ctrl_list:
    ctrlName = str(item_c.name).split("/")[2].replace(".ctrl.tar.gz","") # cotrol file name: 1444447203130120001
    for item_t in bucket_list_tar:
        tarName = str(item_t.name).split("/")[1].replace(".tar.gz","") #tar file name: 1444447203130120001
    #now from above two lists I want to prepare a master list of an common elements which is pick up only 5 elements to proceed further.
    j = 5
    while j <= 5:
        for elem in ctrlName:
            for elem in tarName:
                master_list.append(elem)
                j=j+1
            print master_list

输出:

['c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F']

预期产出:

[444556677788, 123222333444]

任何人都可以帮我理解我犯错的地方吗?

1 个答案:

答案 0 :(得分:0)

我不确定你为什么要以五人一组的方式做事,所以这段代码会同时匹配所有文件:

import boto
import re

conn = boto.connect_s3('REGION')

bucket = conn.get_bucket('BUCKETNAME')

list = bucket.list()

# Get two lists of files
bucket_list_ctrl = bucket.list(prefix='outputs/ctrlFiles/', delimiter='/')
bucket_list_tar  = bucket.list(prefix='outputs/', delimiter='/')

# Extract filenames and modified date
pattern = re.compile('.*?(\d+).*?')
ctrl_files = [(pattern.match(obj.name).group(1), obj.last_modified) for obj in bucket_list_ctrl]
list_files = [pattern.match(obj.name).group(1) for obj in bucket_list_tar if obj.name.endswith('gz')]

# Find filenames that match both
both = [obj for obj in ctrl_files if obj[0] in list_files]

# Give sorted result
result = [f[0] for f in sorted(both, key=lambda obj: obj[1])]