Boto3从S3 Bucket下载所有文件

时间:2015-08-10 11:58:17

标签: python amazon-web-services amazon-s3 boto3

我使用boto3从s3存储桶中获取文件。我需要类似aws s3 sync

的功能

我目前的代码是

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

这个工作正常,只要存储桶只有文件。 如果存储桶中存在文件夹,则会抛出错误

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

这是使用boto3下载完整s3存储桶的正确方法吗?如何下载文件夹。

18 个答案:

答案 0 :(得分:61)

我有同样的需求,并创建了以下递归下载文件的功能。 仅当目录包含文件时才在本地创建目录。

import boto3
import os

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            dest_pathname = os.path.join(local, file.get('Key'))
            if not os.path.exists(os.path.dirname(dest_pathname)):
                os.makedirs(os.path.dirname(dest_pathname))
            resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)

以这种方式调用该函数:

def _start():
    client = boto3.client('s3')
    resource = boto3.resource('s3')
    download_dir(client, resource, 'clientconf/', '/tmp', bucket='my-bucket')

答案 1 :(得分:35)

Amazon S3没有文件夹/目录。这是一个平面文件结构

为了保持目录的外观,路径名称存储为对象键(文件名)的一部分。例如:

  • images/foo.jpg

在这种情况下,整个密钥为images/foo.jpg,而不仅仅是foo.jpg

我怀疑你的问题是boto正在返回一个名为my_folder/.8Df54234的文件,并试图将其保存到本地文件系统。但是,您的本地文件系统将my_folder/部分解释为目录名称,该目录在本地文件系统上不存在

你可以截断文件名只保存.8Df54234部分,或者你必须在编写文件之前创建必要的目录。请注意,它可以是多级嵌套目录。

更简单的方法是使用AWS Command-Line Interface (CLI),这将为您完成所有这些工作,例如:

aws s3 cp --recursive s3://my_bucket_name local_folder

还有一个sync选项,只会复制新文件和修改过的文件。

答案 2 :(得分:27)

import os
import boto3

#initiate s3 resource
s3 = boto3.resource('s3')

# select bucket
my_bucket = s3.Bucket('my_bucket_name')

# download file into current directory
for s3_object in my_bucket.objects.all():
    # Need to split s3_object.key into path and file name, else it will give error file not found.
    path, filename = os.path.split(s3_object.key)
    my_bucket.download_file(s3_object.key, filename)

答案 3 :(得分:10)

我目前正在通过使用以下

来完成任务
foo

虽然它完成了这项工作,但我不确定这样做是好事。 我将它留在这里以帮助其他用户和进一步的答案,以更好的方式实现这个

答案 4 :(得分:8)

迟到总比没有好:)以前的paginator答案非常好。但它是递归的,你最终可能会达到Python的递归限制。这是另一种方法,需要额外的检查。

import os
import errno
import boto3


def assert_dir_exists(path):
    """
    Checks if directory tree in path exists. If not it created them.
    :param path: the path to check if it exists
    """
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(client, bucket, path, target):
    """
    Downloads recursively the given S3 path to the target directory.
    :param client: S3 client to use.
    :param bucket: the name of the bucket to download from
    :param path: The S3 directory to download.
    :param target: the local directory to download the files to.
    """

    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


client = boto3.client('s3')

download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

答案 5 :(得分:1)

一次性获取所有文件是一个非常糟糕的主意,您应该批量购买。

我用来从S3获取特定文件夹(目录)的一个实现是,

def get_directory(directory_path, download_path, exclude_file_names):
    # prepare session
    session = Session(aws_access_key_id, aws_secret_access_key, region_name)

    # get instances for resource and bucket
    resource = session.resource('s3')
    bucket = resource.Bucket(bucket_name)

    for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
        s3_object = s3_key['Key']
        if s3_object not in exclude_file_names:
            bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

并且如果你想让整个桶通过CIL将其用作@John Rotenstein mentioned,如下所示,

aws s3 cp --recursive s3://bucket_name download_path

答案 6 :(得分:1)

我有一个解决方法,在同一个过程中运行AWS CLI。

安装awscli作为python lib:

pip install awscli

然后定义此功能:

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

执行:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

答案 7 :(得分:1)

在使用具有1000多个对象的存储桶时,实现一个解决方案(在最多1000个键的连续集合上使用NextContinuationToken的解决方案是必要的)。该解决方案首先编译对象列表,然后迭代创建指定目录并下载现有对象。

import boto3

s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')

def download_dir(prefix, local=local, bucket=bucket,
                 client=s3_client, resource=s3_resource):
    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = s3_client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    for k in keys:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        resource.meta.client.download_file(bucket, k, dest_pathname)

答案 8 :(得分:1)

我已经更新了 Grant 的答案以并行运行,如果有人感兴趣,它会更快:

from concurrent import futures
import os
import boto3

def download_dir(prefix, local, bucket):

    client = boto3.client('s3')

    def create_folder_and_download_file(k):
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        print(f'downloading {k} to {dest_pathname}')
        client.download_file(bucket, k, dest_pathname)

    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket': bucket,
        'Prefix': prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    with futures.ThreadPoolExecutor() as executor:
        futures.wait(
            [executor.submit(create_folder_and_download_file, k) for k in keys],
            return_when=futures.FIRST_EXCEPTION,
        )

答案 9 :(得分:1)

import boto3, os

s3 = boto3.client('s3')

def download_bucket(bucket):
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket)
    for page in pages:
      if 'Contents' in page:
        for obj in page['Contents']:
            os.path.dirname(obj['Key']) and os.makedirs(os.path.dirname(obj['Key']), exist_ok=True) 
            try:
                s3.download_file(bucket, obj['Key'], obj['Key'])
            except NotADirectoryError:
                pass

# Change bucket_name to name of bucket that you want to download
download_bucket(bucket_name)

这应该适用于所有数量的对象(也适用于超过 1000 个的对象)。每个分页器页面最多可以包含 1000 个对象。请注意 os.makedirs 函数中的额外参数 - exist_ok=True 这导致它在路径存在时不会抛出错误)

答案 10 :(得分:0)

另一个使用 asyncio/aioboto 的并行下载器

import os, time
import asyncio
from itertools import chain
import json
from typing import List
from json.decoder import WHITESPACE
import logging
from functools import partial
from pprint import pprint as pp

# Third Party
import asyncpool
import aiobotocore.session
import aiobotocore.config

_NUM_WORKERS = 50


bucket_name= 'test-data'
bucket_prefix= 'etl2/test/20210330/f_api'


async def save_to_file(s3_client, bucket: str, key: str):
    
    response = await s3_client.get_object(Bucket=bucket, Key=key)
    async with response['Body'] as stream:
        content = await stream.read()
    
    if 1:
        fn =f'out/downloaded/{bucket_name}/{key}'

        dn= os.path.dirname(fn)
        if not isdir(dn):
            os.makedirs(dn,exist_ok=True)
        if 1:
            with open(fn, 'wb') as fh:
                fh.write(content)
                print(f'Downloaded to: {fn}')
   
    return [0]


async def go(bucket: str, prefix: str) -> List[dict]:
    """
    Returns list of dicts of object contents

    :param bucket: s3 bucket
    :param prefix: s3 bucket prefix
    :return: list of download statuses
    """
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger()

    session = aiobotocore.session.AioSession()
    config = aiobotocore.config.AioConfig(max_pool_connections=_NUM_WORKERS)
    contents = []
    async with session.create_client('s3', config=config) as client:
        worker_co = partial(save_to_file, client, bucket)
        async with asyncpool.AsyncPool(None, _NUM_WORKERS, 's3_work_queue', logger, worker_co,
                                       return_futures=True, raise_on_join=True, log_every_n=10) as work_pool:
            # list s3 objects using paginator
            paginator = client.get_paginator('list_objects')
            async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
                for c in result.get('Contents', []):
                    contents.append(await work_pool.push(c['Key'], client))

    # retrieve results from futures
    contents = [c.result() for c in contents]
    return list(chain.from_iterable(contents))


def S3_download_bucket_files():
    s = time.perf_counter()
    _loop = asyncio.get_event_loop()
    _result = _loop.run_until_complete(go(bucket_name, bucket_prefix))
    assert sum(_result)==0, _result
    print(_result)
    elapsed = time.perf_counter() - s
    print(f"{__file__} executed in {elapsed:0.2f} seconds.")

它将首先从 S3 获取文件列表,然后使用 aioboto 下载,_NUM_WORKERS=50 从网络并行读取数据。

答案 11 :(得分:0)

这里的很多解决方案都变得非常复杂。如果您正在寻找更简单的东西,cloudpathlib 会为这个用例以一种很好的方式包装东西,它将下载目录或文件。

d in data

注意:对于包含大量文件的大型文件夹,命令行中的 from cloudpathlib import CloudPath cp = CloudPath("s3://bucket/product/myproject/2021-02-15/") cp.download_to("local_folder") 可能会更快。

答案 12 :(得分:0)

From AWS S3 Docs (How do I use folders in an S3 bucket?):

在Amazon S3中,存储桶和对象是主要资源,并且对象存储在存储桶中。 Amazon S3具有扁平结构,而不是像文件系统中那样的层次结构。但是,为了简化组织,Amazon S3控制台支持文件夹概念作为对对象进行分组的一种方式。 Amazon S3通过为对象使用共享名称前缀来实现此目的(也就是说,对象的名称以公共字符串开头)。对象名称也称为键名称。

例如,您可以在控制台上创建一个名为photos的文件夹,并在其中存储一个名为myphoto.jpg的对象。然后,该对象将以键名photos / myphoto.jpg进行存储,其中photos /是前缀。

要将所有文件从“ mybucket”下载到当前目录,请遵循存储桶的模拟目录结构(如果存储桶中的本地文件夹不存在,则从存储桶中创建文件夹) ):

import boto3
import os

bucket_name = "mybucket"
s3 = boto3.client("s3")
objects = s3.list_objects(Bucket = bucket_name)["Contents"]
for s3_object in objects:
    s3_key = s3_object["Key"]
    path, filename = os.path.split(s3_key)
    if len(path) != 0 and not os.path.exists(path):
        os.makedirs(path)
    if not s3_key.endswith("/"):
        download_to = path + '/' + filename if path else filename
        s3.download_file(bucket_name, s3_key, download_to)

答案 13 :(得分:0)

一段时间以来,我一直在遇到这个问题,而在经历过的所有不同论坛中,我都没有看到完整的端到端摘录。因此,我继续进行所有工作(自己添加一些东西),并创建了完整的端到端S3下载器!

这不仅会自动下载文件,而且如果S3文件位于子目录中,它将在本地存储上创建它们。在我的应用程序实例中,我需要设置权限和所有者,因此我也添加了这些权限(如果不需要,可以将其注释掉)。

这已经过测试并且可以在Docker环境(K8)中工作,但是我已经在脚本中添加了环境变量,以防万一您想在本地测试/运行它。

我希望这有助于某人寻求S3下载自动化。我也欢迎任何有关如何在需要时更好地对其进行优化的建议,信息等。

#!/usr/bin/python3
import gc
import logging
import os
import signal
import sys
import time
from datetime import datetime

import boto
from boto.exception import S3ResponseError
from pythonjsonlogger import jsonlogger

formatter = jsonlogger.JsonFormatter('%(message)%(levelname)%(name)%(asctime)%(filename)%(lineno)%(funcName)')

json_handler_out = logging.StreamHandler()
json_handler_out.setFormatter(formatter)

#Manual Testing Variables If Needed
#os.environ["DOWNLOAD_LOCATION_PATH"] = "some_path"
#os.environ["BUCKET_NAME"] = "some_bucket"
#os.environ["AWS_ACCESS_KEY"] = "some_access_key"
#os.environ["AWS_SECRET_KEY"] = "some_secret"
#os.environ["LOG_LEVEL_SELECTOR"] = "DEBUG, INFO, or ERROR"

#Setting Log Level Test
logger = logging.getLogger('json')
logger.addHandler(json_handler_out)
logger_levels = {
    'ERROR' : logging.ERROR,
    'INFO' : logging.INFO,
    'DEBUG' : logging.DEBUG
}
logger_level_selector = os.environ["LOG_LEVEL_SELECTOR"]
logger.setLevel(logger_level_selector)

#Getting Date/Time
now = datetime.now()
logger.info("Current date and time : ")
logger.info(now.strftime("%Y-%m-%d %H:%M:%S"))

#Establishing S3 Variables and Download Location
download_location_path = os.environ["DOWNLOAD_LOCATION_PATH"]
bucket_name = os.environ["BUCKET_NAME"]
aws_access_key_id = os.environ["AWS_ACCESS_KEY"]
aws_access_secret_key = os.environ["AWS_SECRET_KEY"]
logger.debug("Bucket: %s" % bucket_name)
logger.debug("Key: %s" % aws_access_key_id)
logger.debug("Secret: %s" % aws_access_secret_key)
logger.debug("Download location path: %s" % download_location_path)

#Creating Download Directory
if not os.path.exists(download_location_path):
    logger.info("Making download directory")
    os.makedirs(download_location_path)

#Signal Hooks are fun
class GracefulKiller:
    kill_now = False
    def __init__(self):
        signal.signal(signal.SIGINT, self.exit_gracefully)
        signal.signal(signal.SIGTERM, self.exit_gracefully)
    def exit_gracefully(self, signum, frame):
        self.kill_now = True

#Downloading from S3 Bucket
def download_s3_bucket():
    conn = boto.connect_s3(aws_access_key_id, aws_access_secret_key)
    logger.debug("Connection established: ")
    bucket = conn.get_bucket(bucket_name)
    logger.debug("Bucket: %s" % str(bucket))
    bucket_list = bucket.list()
#    logger.info("Number of items to download: {0}".format(len(bucket_list)))

    for s3_item in bucket_list:
        key_string = str(s3_item.key)
        logger.debug("S3 Bucket Item to download: %s" % key_string)
        s3_path = download_location_path + "/" + key_string
        logger.debug("Downloading to: %s" % s3_path)
        local_dir = os.path.dirname(s3_path)

        if not os.path.exists(local_dir):
            logger.info("Local directory doesn't exist, creating it... %s" % local_dir)
            os.makedirs(local_dir)
            logger.info("Updating local directory permissions to %s" % local_dir)
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(local_dir, 0o775)
            os.chown(local_dir, 60001, 60001)
        logger.debug("Local directory for download: %s" % local_dir)
        try:
            logger.info("Downloading File: %s" % key_string)
            s3_item.get_contents_to_filename(s3_path)
            logger.info("Successfully downloaded File: %s" % s3_path)
            #Updating Permissions
            logger.info("Updating Permissions for %s" % str(s3_path))
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(s3_path, 0o664)
            os.chown(s3_path, 60001, 60001)
        except (OSError, S3ResponseError) as e:
            logger.error("Fatal error in s3_item.get_contents_to_filename", exc_info=True)
            # logger.error("Exception in file download from S3: {}".format(e))
            continue
        logger.info("Deleting %s from S3 Bucket" % str(s3_item.key))
        s3_item.delete()

def main():
    killer = GracefulKiller()
    while not killer.kill_now:
        logger.info("Checking for new files on S3 to download...")
        download_s3_bucket()
        logger.info("Done checking for new files, will check in 120s...")
        gc.collect()
        sys.stdout.flush()
        time.sleep(120)
if __name__ == '__main__':
    main()

答案 14 :(得分:0)

在@glefait的答案末尾加上if条件,以避免os错误20。它获得的第一个键是文件夹名称本身,该名称不能写在目标路径中。

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            print("Content: ",result)
            dest_pathname = os.path.join(local, file.get('Key'))
            print("Dest path: ",dest_pathname)
            if not os.path.exists(os.path.dirname(dest_pathname)):
                print("here last if")
                os.makedirs(os.path.dirname(dest_pathname))
            print("else file key: ", file.get('Key'))
            if not file.get('Key') == dist:
                print("Key not equal? ",file.get('Key'))
                resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)enter code here

答案 15 :(得分:0)

我获得了类似的要求,并且通过阅读上述解决方案以及其他网站获得了帮助,我想出了以下脚本,只是想分享一下它是否可以帮助任何人。

from boto3.session import Session
import os

def sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path):    
    session = Session(aws_access_key_id=access_key_id,aws_secret_access_key=secret_access_key)
    s3 = session.resource('s3')
    your_bucket = s3.Bucket(bucket_name)
    for s3_file in your_bucket.objects.all():
        if folder in s3_file.key:
            file=os.path.join(destination_path,s3_file.key.replace('/','\\'))
            if not os.path.exists(os.path.dirname(file)):
                os.makedirs(os.path.dirname(file))
            your_bucket.download_file(s3_file.key,file)
sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path)

答案 16 :(得分:0)

如果要使用python调用bash脚本,这是一种将文件从S3存储桶中的文件夹加载到本地文件夹(在Linux计算机中)的简单方法:

import boto3
import subprocess
import os

###TOEDIT###
my_bucket_name = "your_my_bucket_name"
bucket_folder_name = "your_bucket_folder_name"
local_folder_path = "your_local_folder_path"
###TOEDIT###

# 1.Load thes list of files existing in the bucket folder
FILES_NAMES = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('{}'.format(my_bucket_name))
for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)):
#     print(object_summary.key)
    FILES_NAMES.append(object_summary.key)

# 2.List only new files that do not exist in local folder (to not copy everything!)
new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path)))

# 3.Time to load files in your destination folder 
for new_filename in new_filenames:
    upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path)

    subprocess_call = subprocess.call([upload_S3files_CMD], shell=True)
    if subprocess_call != 0:
        print("ALERT: loading files not working correctly, please re-check new loaded files")

答案 17 :(得分:0)

for objs in my_bucket.objects.all():
    print(objs.key)
    path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
    try:
        if not os.path.exists(path):
            os.makedirs(path)
        my_bucket.download_file(objs.key, '/tmp/'+objs.key)
    except FileExistsError as fe:                          
        print(objs.key+' exists')

此代码将下载/tmp/目录中的内容。如果需要,可以更改目录。