Boto3从S3下载文件夹

时间:2018-04-11 10:02:56

标签: python amazon-s3 boto3 botocore

我使用python和boto3进行Python SDK aws。我可以使用bucket.download_file方法下载文件。 有没有办法使用Python和Boto3库从S3下载文件夹?

7 个答案:

答案 0 :(得分:8)

对Konstantinos Katsantonis接受的答案的肮脏修改:

import boto3
s3 = boto3.resource('s3') # assumes credentials & configuration are handled outside python in .aws directory or environment variables

def download_s3_folder(bucket_name, s3_folder, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        bucket_name: the name of the s3 bucket
        s3_folder: the folder path in the s3 bucket
        local_dir: a relative or absolute directory path in the local file system
    """
    bucket = s3.Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=s3_folder):
        target = obj.key if local_dir is None \
            else os.path.join(local_dir, os.path.relpath(obj.key, s3_folder))
        if not os.path.exists(os.path.dirname(target)):
            os.makedirs(os.path.dirname(target))
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, target)

这也会下载嵌套的子目录。我能够下载其中包含3000多个文件的目录。您可以在script block找到其他解决方案,但我不知道它们是否更好。

答案 1 :(得分:3)

使用boto3可以设置AWS凭证并从S3下载数据集

import boto3
import os 

# set aws credentials 
s3r = boto3.resource('s3', aws_access_key_id='xxxxxxxxxxxxxxxxx',
    aws_secret_access_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
bucket = s3r.Bucket('bucket_name')

# downloading folder 
prefix = 'dirname'
for object in bucket.objects.filter(Prefix = 'dirname'):
    if object.key == prefix:
        os.makedirs(os.path.dirname(object.key), exist_ok=True)
        continue;
    bucket.download_file(object.key, object.key)

如果找不到您的access_keysecret_access_key,请参考此page
我希望这会有所帮助。
谢谢。

答案 2 :(得分:2)

快速又脏,但是它可以工作:

def downloadDirectoryFroms3(bucketName,remoteDirectoryName):
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucketName) 
    for key in bucket.objects.filter(Prefix = remoteDirectoryName):
        if not os.path.exists(os.path.dirname(key.key)):
            os.makedirs(os.path.dirname(key.key))
        bucket.download_file(key.key,key.key)

假设您要从s3下载目录/ foo / bar,则for循环将迭代路径以Prefix = / foo / bar开头的所有文件。

答案 3 :(得分:2)

您也可以使用 cloudpathlib,对于 S3,它包装 boto3。对于您的用例,这非常简单:

from cloudpathlib import CloudPath

cp = CloudPath("s3://bucket/folder/folder2/")
cp.download_to("local_folder")

答案 4 :(得分:1)

基于@bjc 的答案的另一种方法,它利用内置的 Path 库并为您解析 s3 uri:

import boto3
from pathlib import Path
from urllib.parse import urlparse

def download_s3_folder(s3_uri, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        s3_uri: the s3 uri to the top level of the files you wish to download
        local_dir: a relative or absolute directory path in the local file system
    """
    s3 = boto3.resource("s3")
    bucket = s3.Bucket(urlparse(s3_uri).hostname)
    s3_path = urlparse(s3_uri).path.lstrip('/')
    if local_dir is not None:
        local_dir = Path(local_dir)
    for obj in bucket.objects.filter(Prefix=s3_path):
        target = obj.key if local_dir is None else local_dir / Path(obj.key).relative_to(s3_path)
        target.parent.mkdir(parents=True, exist_ok=True)
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, str(target))

答案 5 :(得分:1)

以上方案都不错,依赖S3资源。
以下解决方案实现了相同的目标,但应用了 s3_client.
您可能会发现它对您有用(我已经对其进行了测试,并且效果很好)。

import boto3
from os import path, makedirs
from botocore.exceptions import ClientError
from boto3.exceptions import S3TransferFailedError

def download_s3_folder(s3_folder, local_dir, aws_access_key_id, aws_secret_access_key, aws_bucket, debug_en):
    """ Download the contents of a folder directory into a local area """

    success = True

    print('[INFO] Downloading %s from bucket %s...' % (s3_folder, aws_bucket))

    def get_all_s3_objects(s3, **base_kwargs):
        continuation_token = None
        while True:
            list_kwargs = dict(MaxKeys=1000, **base_kwargs)
            if continuation_token:
                list_kwargs['ContinuationToken'] = continuation_token
            response = s3.list_objects_v2(**list_kwargs)
            yield from response.get('Contents', [])
            if not response.get('IsTruncated'):
                break
            continuation_token = response.get('NextContinuationToken')

    s3_client = boto3.client('s3',
                             aws_access_key_id=aws_access_key_id,
                             aws_secret_access_key=aws_secret_access_key)

    all_s3_objects_gen = get_all_s3_objects(s3_client, Bucket=aws_bucket)

    for obj in all_s3_objects_gen:
        source = obj['Key']
        if source.startswith(s3_folder):
            destination = path.join(local_dir, source)
            if not path.exists(path.dirname(destination)):
                makedirs(path.dirname(destination))
            try:
                s3_client.download_file(aws_bucket, source, destination)
            except (ClientError, S3TransferFailedError) as e:
                print('[ERROR] Could not download file "%s": %s' % (source, e))
                success = False
            if debug_en:
                print('[DEBUG] Downloading: %s --> %s' % (source, destination))

    return success

答案 6 :(得分:0)

您可以从 python 调用 awscli cp 命令来下载整个文件夹

 import os
 import subprocess

 remote_folder_name = 's3://my-bucket/my-dir'
 local_path = '.'
 if not os.path.exists(local_path):
     os.makedirs(local_path)
 subprocess.run(['aws', 's3', 'cp', remote_folder_name, local_path, '--recursive'])

有关此解决方案的一些说明:

  1. 您应该安装 awscli (pip install awscli) 并对其进行配置。更多信息here
  2. 如果您不想覆盖未更改的现有文件,您可以使用 sync 而不是 cp subprocess.run(['aws', 's3', 'sync', remote_folder_name, local_path])
  3. 在 python 3.6 上测试。在早期版本的 python 中,您可能需要将 subprocess.run 替换为 subprocess.callos.system
  4. 这段代码执行的cli命令是aws s3 cp s3://my-bucket/my-dir . --recursive