在S3存储桶文件夹中下载最新文件

时间:2018-12-12 15:04:08

标签: python amazon-s3 boto3 botocore

我正在编写一个Python脚本,以从S3存储桶的文件夹中下载最新文件。我了解如何从S3存储桶下载最新的文件对象,但是我要下载的文件位于存储桶内的文件夹中。我完全不知道如何执行此操作以及将其添加到我的代码中的位置。我尝试将路径放在存储桶链接的末尾,但这似乎不起作用。

start

1 个答案:

答案 0 :(得分:0)

信用在注释中,这只是对先前答案的小修改。

def download_latest_in_dir(prefix, local, bucket, client=boto3.client('s3'), nLatest=2):
    """
    from https://stackoverflow.com/questions/31918960/boto3-to-download-all-files-from-a-s3-bucket/31929277

    params:
    - prefix: pattern to match in s3
    - local: local path to folder in which to place files
    - bucket: s3 bucket with target contents
    - client: initialized s3 client object
    - nLatest: number of the most recent files to fetch from aws

    Example: download two latest files from aws directory ieee-temp/sst to local directory /home/hu-mka/Downloads/sst
    download_latest_in_dir(prefix='sst', local='/home/hu-mka/Downloads', bucket='ieee-temp', client=boto3.client('s3'), nLatest=2)
    """
    files = []
    times = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    ipage = 0
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                files.append(k)
                t = i.get('LastModified')
                times.append(t)
            else:
                print(f"Warning: there was a sub direcotory which we omit: {k}")
                #dirs.append(k)
        print(f"A page read {ipage}, last item: {files[-1]}, its time stamp:{times[-1]}")
        next_token = results.get('NextContinuationToken')
        ipage += 1
        #if ipage > 2:
        #    break
    # https://stackoverflow.com/questions/6618515/sorting-list-based-on-values-from-another-list
    time_sorted_filenames = [x for _, x in sorted(zip(times, files))]
    #print(time_sorted_filenames)
    for k in time_sorted_filenames[-nLatest:]:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        client.download_file(bucket, k, dest_pathname)