Python 3 - 如何正确设置此多处理作业?

时间:2017-06-05 21:12:23

标签: python python-3.x parallel-processing multiprocessing

我有一个包含10,000行的文件,每行代表下载作业的参数。我有5个自定义下载器。每项工作可能需要5秒到2分钟。我如何创建迭代10,000行的内容,如果下载器当前没有工作,则将每个作业分配给下载器?

编辑:

对我来说困难的部分是每个Downloader是一个类的实例,并且实例之间的差异是我在实例化5个Downloader对象中的每一个时指定的port_numbers。所以我有a = Downloader(port_number=7751) ... e = Downloader(port_number=7755)。然后,如果我使用Downloader,我会a.run(row)

如何将工作人员定义为a, b, c, d, e而不是downloader function

2 个答案:

答案 0 :(得分:2)

有很多方法可以做到这一点 - 最简单的方法就是使用multiprocessing.Pool并让它为你组织工作人员--10k行并不是那么多,让我们说平均URL甚至是全千字节长,它仍然只需10MB内存,内存便宜。

因此,只需在内存中阅读该文件并将其映射到multiprocessing.Pool即可进行出价:

from multiprocessing import Pool

def downloader(param):  # our downloader process
    # download code here
    # param will hold a line from your file (including newline at the end, strip before use)
    # e.g. res = requests.get(param.strip())
    return True  # lets provide some response back

if __name__ == "__main__":  # important protection for cross-platform use

    with open("your_file.dat", "r") as f:  # open your file
        download_jobs = f.readlines()  # store each line in a list

    download_pool = Pool(processes=5)  # make our pool use 5 processes
    responses = download_pool.map(downloader, download_jobs)  # map our data, line by line
    download_pool.close()  # lets exit cleanly
    # you can check the responses for each line in the `responses` list

如果您需要共享内存,您还可以使用threading代替multiprocessing(或multiprocessing.pool.ThreadPool作为此替代品)来执行单个进程中的所有操作。除非您正在进行其他处理,否则单个线程足以进行下载。

更新

如果您希望下载程序作为类实例运行,可以将downloader函数转换为Downloader实例的工厂,然后只传递您需要的实例化这些实例以及URL 。这是一个简单的Round-Robin方法:

from itertools import cycle
from multiprocessing import Pool

class Downloader(object):

    def __init__(self, port_number=8080):
        self.port_number = port_number

    def run(self, url):
        print("Downloading {} on port {}".format(url, self.port_number))

def init_downloader(params):  # our downloader initializator
    downloader = Downloader(**params[0])  # instantiate our downloader
    downloader.run(params[1])  # run our downloader
    return True  # you can provide your

if __name__ == "__main__":  # important protection for cross-platform use

    downloader_params = [  # Downloaders will be initialized using these params
        {"port_number": 7751},
        {"port_number": 7851},
        {"port_number": 7951}
    ]

    downloader_cycle = cycle(downloader_params)  # use cycle for round-robin distribution
    with open("your_file.dat", "r") as f:  # open your file
        # read our file line by line and attach downloader params to it
        download_jobs = [[next(downloader_cycle), row.strip()] for row in f]

    download_pool = Pool(processes=5)  # make our pool use 5 processes
    responses = download_pool.map(init_downloader, download_jobs)  # map our data
    download_pool.close()  # lets exit cleanly
    # you can check the responses for each line in the `responses` list

请记住,这不是最平衡的解决方案,因为它可能会发生两个运行相同端口的Downloader个实例,但它会对足够大的数据进行平均。

如果您想确保没有两个Downloader实例在同一个端口上运行,您需要构建自己的池,或者您需要创建一个中央流程这将在他们需要时向您的Downloader实例发出端口。

答案 1 :(得分:1)

将10000行读入字符串列表。

with open('foo.dat') as f:
    data = f.readlines()

假设数据不包含端口号,编辑的问题提到了5个端口,您应该将其添加到数据中。

data = [(p, d) for p, d in zip(itertools.cycle([7751, 7752, 7753, 7754, 7755]), data)]

编写一个函数,将其中一个元组作为参数,将其拆分,创建一个Downloader对象并运行它。

def worker(target):
    port, params = target
    d = Downloader(port_number=port)
    d.run(params)
    return params # for lack of more information.

使用multiprocessing.Pool的{​​{3}}方法,将其定义的函数和元组列表作为参数。

imap_unordered返回的迭代器一旦可用就会开始产生结果。您可以打印它们以显示进度。

p = multiprocessing.Pool()
for params in p.imap_unordered(worker, data):
    print('Finished downloading', params)

修改

P.S:如果您使用的Downloader对象的唯一方法是run()它不应该是对象。这是伪装的功能!在Youtube上查看“停止写作课程”视频并观看。