我正在尝试使用jupyter笔记本从s3存储桶下载12,000个文件,该笔记本估计将在21小时内完成下载。这是因为每个文件一次下载一个。我们可以进行多个并行下载,这样我就可以加快这个过程吗?
目前,我使用以下代码下载所有文件
### Get unique full-resolution image basenames
images = df['full_resolution_image_basename'].unique()
print(f'No. of unique full-resolution images: {len(images)}')
### Create a folder for full-resolution images
images_dir = './images/'
os.makedirs(images_dir, exist_ok=True)
### Download images
images_str = "','".join(images)
limiting_clause = f"CONTAINS(ARRAY['{images_str}'],
full_resolution_image_basename)"
_ = download_full_resolution_images(images_dir,
limiting_clause=limiting_clause)
答案 0 :(得分:3)
请参阅下面的代码。这只适用于python 3.6+,因为f-string(PEP 498)。对旧版本的python使用不同的字符串格式化方法。
提供relative_path
,bucket_name
和s3_object_keys
。此外,max_workers是可选的,如果没有提供,则该数字将是机器处理器数量的5倍的倍数。
此答案的大部分代码来自How to create an async generator in Python? 的答案,该答案来自库中此示例documented。
import boto3
import os
from concurrent import futures
relative_path = './images'
bucket_name = 'bucket_name'
s3_object_keys = [] # List of S3 object keys
max_workers = 5
abs_path = os.path.abspath(relative_path)
s3 = boto3.client('s3')
def fetch(key):
file = f'{abs_path}/{key}'
os.makedirs(file, exist_ok=True)
with open(file, 'wb') as data:
s3.download_fileobj(bucket_name, key, data)
return file
def fetch_all(keys):
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_key = {executor.submit(fetch, key): key for key in keys}
print("All URLs submitted.")
for future in futures.as_completed(future_to_key):
key = future_to_key[future]
exception = future.exception()
if not exception:
yield key, future.result()
else:
yield key, exception
for key, result in fetch_all(S3_OBJECT_KEYS):
print(f'key: {key} result: {result}')