Question

我正在将Python的smart_open与.tif一起使用来下载文件（一个def stream_download_s3(url, aws_key, aws_secret, aws_bucket_name, path, auth): """ Stream files from request to S3 """ headers = {'Authorization': f'Bearer {auth}', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15'} session = boto3.Session( aws_access_key_id=aws_key, aws_secret_access_key=aws_secret ) bucket_path_strip = path[5:] bucket_name_strip = aws_bucket_name[5:] with requests.Session() as s: s.headers.update(headers) try: with s.get(url) as r: if r.status_code == requests.codes.ok: soup = BeautifulSoup(r.content) download_files = [link.contents[0] for link in soup.find_all('a') if '.tif' in link.contents[0]] for file_name in download_files: save_file = os.path.join(path, file_name) if check_s3_exists(session, bucket_name_strip, os.path.join(bucket_path_strip, file_name)): print(f'S3: {os.path.join(path, file_name)} already exists. Skipping download') else: with s.get(url + file_name) as file_request: if file_request.status_code == requests.codes.ok: with smart_open.open(save_file, 'wb', transport_params=dict(session=session)) as so: so.write(file_request.content) else: print(f'Request GET failed with {r.content} [{r.url}]') except requests.exceptions.HTTPError as err: print(f'{err}')文件，以备不时之需），然后将其上传到S3存储桶中而不保存任何临时文件。我在每个请求中遍历数千个URL。这是我写的函数：

bs4

此功能是第一个请求，以抓取所有可用的图像URL（即smart_open部分），然后遍历所有检索到的URL并下载其内容。返回的请求内容是我发送到open的{{1}}函数以上传到S3的二进制文件。

整个过程大约需要150分钟才能处理510张图像（小于2 Gb），而wget和aws s3 ls的组合在大约86分钟内完成了相同的操作（wget 1h 26m 46s和s3 cp花费了几秒钟）。

正在考虑的一些选项：

我正在使用AWS机器，尽管某些API会禁止类似AWS的IP，但事实并非如此。如果它使下载速度变慢，我不知道。另外，S3和EC2在同一区域。
我知道stream=True中的requests.get()是一种替代方法，但据我所知，这主要用于在不占用内存的情况下流式传输大文件。那会改变什么吗？
使用io.BytesIO的类似实现得到相似的结果。我在那里做错了吗？
我使用requests是因为我喜欢API（很多！），但是如果还有其他选择，我可以尝试:-)

Answer 1

我找到了方法！

我意识到问题不在于上传过程。相反，由于请求的服务器正在减慢我的请求，因此使用多进程更为合理。我没有立即使用multiprocess；此功能是Luigi管道中的移动部分，因此我不确定如何在已经为每个Task带来多种待遇的代码中使用多进程。

我尝试一下并使用了concurrent.futures（我认为仅适用于> 3.6），结果令人满意。这与上面的功能相同，但具有并行化功能：

def stream_download_s3_parallel(url,
                                aws_key,
                                aws_secret,
                                aws_bucket_name,
                                path,
                                auth,
                                max_workers=10):
    """
    Stream files from request to S3
    """

    headers = {'Authorization': f'Bearer {auth}',
               'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15'}
    session = boto3.Session(
        aws_access_key_id=aws_key,
        aws_secret_access_key=aws_secret
    )

    with requests.Session() as s:
        s.headers.update(headers)
        try:
            with s.get(url) as r:
                if r.status_code == requests.codes.ok:
                    soup = BeautifulSoup(r.content)
                    download_files = [link.contents[0] for link in
                                      soup.find_all('a') if '.tif' in
                                     link.contents[0]]

                    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                        future_to_url = {executor.submit(requests_to_s3, 
                                                         url,
                                                         file_name,
                                                         aws_bucket_name,
                                                         path,
                                                         auth,
                                                         session): file_name for file_name in download_files}
                        return future_to_url

这里requests_to_s3是一个简单的函数，它具有提交请求和使用smart_open上传到S3所需的几个参数，基本上与问题中的代码相同。 concurrent.futures.ThreadPoolExecutor返回在池中运行的所有进程的生成器。由于我将其直接保存到S3中，因此没有意义，但是如果您遇到这种情况，则可以执行以下操作：

results_process = []
for treat_proc in concurrent.futures.as_completed(future_to_url):
    results_process.append(threat_proc.result())

这将添加您的函数所返回的内容，并将其放入results_process列表中。

我仍然不确定我是否更喜欢这种方式进行多进程处理，而不是旧的multiprocessing库，似乎cleaner。

请求+ s3非常缓慢

正在考虑的一些选项：

1 个答案: