我有一个用例,其中需要通过使用多个线程来部分下载大型远程文件。 每个线程必须同时(并行)运行,以获取文件的特定部分。 成功下载所有部件后,期望将这些部件组合成一个(原始)文件。
也许使用请求库可以完成这项工作,但是我不确定如何将其多线程化为将这些块组合在一起的解决方案。
url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"} # first megabyte
r = get(url, headers=headers)
我还考虑过使用curl来安排Python的下载,但是我不确定这是正确的方法。它似乎太复杂了,已经偏离了原始的Python解决方案。像这样:
curl --range 200000000-399999999 -o file.iso.part2
有人可以解释您如何处理这样的事情吗?还是发布一个在Python 3中可用的代码示例?我通常很容易找到与Python相关的答案,但似乎无法解决这个问题。
答案 0 :(得分:1)
您可以使用grequests并行下载。
import grequests
URL = 'https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.1.0-amd64-netinst.iso'
CHUNK_SIZE = 104857600 # 100 MB
HEADERS = []
_start, _stop = 0, 0
for x in range(4): # file size is > 300MB, so we download in 4 parts.
_start = _stop
_stop = 104857600 * (x + 1)
HEADERS.append({"Range": "bytes=%s-%s" % (_start, _stop)})
rs = (grequests.get(URL, headers=h) for h in HEADERS)
downloads = grequests.map(rs)
with open('/tmp/debian-10.1.0-amd64-netinst.iso', 'ab') as f:
for download in downloads:
print(download.status_code)
f.write(download.content)
PS:我没有检查范围是否正确确定,以及下载的md5sum是否匹配!总体而言,这应该表明它如何工作。
答案 1 :(得分:1)
这里是使用Python 3和Asyncio的版本,这只是一个示例,可以进行改进,但是您应该能够获得所需的一切。
get_size
:发送HEAD请求以获取文件大小download_range
:下载单个块download
:下载所有块并将其合并import asyncio
import concurrent.futures
import requests
import os
URL = 'https://file-examples.com/wp-content/uploads/2017/04/file_example_MP4_1920_18MG.mp4'
OUTPUT = 'video.mp4'
async def get_size(url):
response = requests.head(url)
size = int(response.headers['Content-Length'])
return size
def download_range(url, start, end, output):
headers = {'Range': f'bytes={start}-{end}'}
response = requests.get(url, headers=headers)
with open(output, 'wb') as f:
for part in response.iter_content(1024):
f.write(part)
async def download(executor, url, output, chunk_size=1000000):
loop = asyncio.get_event_loop()
file_size = await get_size(url)
chunks = range(0, file_size, chunk_size)
tasks = [
loop.run_in_executor(
executor,
download_range,
url,
start,
start + chunk_size - 1,
f'{output}.part{i}',
)
for i, start in enumerate(chunks)
]
await asyncio.wait(tasks)
with open(output, 'wb') as o:
for i in range(len(chunks)):
chunk_path = f'{output}.part{i}'
with open(chunk_path, 'rb') as s:
o.write(s.read())
os.remove(chunk_path)
if __name__ == '__main__':
executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(
download(executor, URL, OUTPUT)
)
finally:
loop.close()