如何并行化文件下载?

时间:2015-08-03 09:58:58

标签: python python-3.x download subprocess wget

我可以一次下载文件:

subprocess

我可以尝试import subprocess import os def parallelized_commandline(command, files, max_processes=2): processes = set() for name in files: processes.add(subprocess.Popen([command, name])) if len(processes) >= max_processes: os.wait() processes.difference_update( [p for p in processes if p.poll() is not None]) #Check if all the child processes were closed for p in processes: if p.poll() is None: p.wait() urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz', 'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz'] parallelized_commandline('wget', urls) 这样:

urlretrieve

有没有办法在不使用os.systemsubprocess作弊的情况下并行subprocess.Popen

鉴于我必须求助于"作弊"目前,parallelized_commandline()是下载数据的正确方法吗?

使用上面的wget时,var tempArr = []; s = {"Toothless":"Dragon","Foo":"Bar"}; $.each(s,function(i,v){ tempArr.push([i,v]); });使用多线程但不是多核,这是正常的吗?有没有办法让它成为多核而不是多线程?

1 个答案:

答案 0 :(得分:33)

您可以使用线程池并行下载文件:

#!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve

urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

您还可以使用asyncio在一个帖子中一次下载多个文件:

#!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp

@asyncio.coroutine
def download(url, session, semaphore, chunk_size=1<<15):
    with (yield from semaphore): # limit number of concurrent downloads
        filename = url2filename(url)
        logging.info('downloading %s', filename)
        response = yield from session.get(url)
        with closing(response), open(filename, 'wb') as file:
            while True: # save file
                chunk = yield from response.content.read(chunk_size)
                if not chunk:
                    break
                file.write(chunk)
        logging.info('done %s', filename)
    return filename, (response.status, tuple(response.headers.items()))

urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, \
     closing(aiohttp.ClientSession()) as session:
    semaphore = asyncio.Semaphore(4)
    download_tasks = (download(url, session, semaphore) for url in urls)
    result = loop.run_until_complete(asyncio.gather(*download_tasks))

其中url2filename() is defined here