我正在尝试使用boto从aws s3存储桶中获取对象列表。此列表由两个不同列表的共同元素组成。我希望这个列表按照S3桶的升序顺序按对象的“last_modified”排序。意思是,我希望旧对象(基于日期)在我的列表中排在第一位。所以,我正在尝试准备这样的5个元素的列表。我想获取此列表并仅处理属于此列表的那些文件,并最终删除这些文件并以相同方式拾取下一个5个元素的列表。
以下是存储桶层次结构: -
//ship-my-data/outputs/444556677788.tar.gz
//ship-my-data/outputs/444556677788.tar.gz
//ship-my-data/outputs/345345345353.tar.gz
//ship-my-data/outputs1/ctrlFiles/ 444556677788.ctrl.tar.gz
//ship-my-data/outputs1/ctrlFiles/ 123222333444.ctrl.tar.gz
//ship-my-data/outputs1/ctrlFiles/ 769797977979.ctrl.tar.gz
我想列出上面两个文件夹中的常用元素列表,即来自outputs1
& ctrlFiles
文件夹。
这是我的代码:
bucket = LogShip._aws_connection.get_bucket(aws_bucket_to_download) #Connecting to AWS s3 bucket
bucket_list_ctrl = bucket.list(prefix='outputs/ctrlFiles/', delimiter='/') #get the bucket list for control files.
ctrl_list = sorted(bucket_list_ctrl, key=lambda item1: item1.last_modified) # sort the list by last_modified date.
bucket_list_tar = bucket.list(prefix='outputs/', delimiter='/') #get the list for tar files.
tar_list = sorted(bucket_list_tar, key=lambda item2: item2.last_modified) #suppose to get the bucket list, but throwing an error #AttributeError: 'Prefix' object has no attribute 'last_modified'""
for item_c in ctrl_list:
ctrlName = str(item_c.name).split("/")[2].replace(".ctrl.tar.gz","") # cotrol file name: 1444447203130120001
for item_t in bucket_list_tar:
tarName = str(item_t.name).split("/")[1].replace(".tar.gz","") #tar file name: 1444447203130120001
#now from above two lists I want to prepare a master list of an common elements which is pick up only 5 elements to proceed further.
j = 5
while j <= 5:
for elem in ctrlName:
for elem in tarName:
master_list.append(elem)
j=j+1
print master_list
输出:
['c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F', 'i', 'l', 'e', 's', 'c', 't', 'r', 'l', 'F']
预期产出:
[444556677788, 123222333444]
任何人都可以帮我理解我犯错的地方吗?
答案 0 :(得分:0)
我不确定你为什么要以五人一组的方式做事,所以这段代码会同时匹配所有文件:
import boto
import re
conn = boto.connect_s3('REGION')
bucket = conn.get_bucket('BUCKETNAME')
list = bucket.list()
# Get two lists of files
bucket_list_ctrl = bucket.list(prefix='outputs/ctrlFiles/', delimiter='/')
bucket_list_tar = bucket.list(prefix='outputs/', delimiter='/')
# Extract filenames and modified date
pattern = re.compile('.*?(\d+).*?')
ctrl_files = [(pattern.match(obj.name).group(1), obj.last_modified) for obj in bucket_list_ctrl]
list_files = [pattern.match(obj.name).group(1) for obj in bucket_list_tar if obj.name.endswith('gz')]
# Find filenames that match both
both = [obj for obj in ctrl_files if obj[0] in list_files]
# Give sorted result
result = [f[0] for f in sorted(both, key=lambda obj: obj[1])]