多处理期间的Python标准输出

时间:2015-03-26 07:39:54

标签: python multiprocessing stdout

我在网站上乱跑,我想打印出一个显示进度的计数器。我在串行处理期间工作了这个。 (这是两个步骤)

from multiprocessing import Pool
from sys import stdout
from bs4 import BeautifulSoup

global searched_counter,processed_counter
searched_counter = 0
processed_counter = 0

def run_scrape(var_input):
    global searched_counter,processed_counter
    #get search results
    parsed = #parse using bs4

    searched_counter += 1
    stdout.write("\rTotal Searched/Processed: %d/%d" % (searched_counter,processed_counter))
    stdout.flush()

    if parsed:       #only go to next page if result is what I want
        #get the page I want using parsed data
        #parse some more and write out to file

        processed_counter += 1
        stdout.write("\rTotal Searched/Processed: %d/%d" % (searched_counter,processed_counter))
        stdout.flush()    


list_to_scrape = ["data%05d" % (x,) for x in range(1,10000)]
pool = Pool(8)
pool.map(run_scrape,list_to_scrape)

stdout.write('\n')

当我使用多处理程序运行它时,它会变得混乱并打印出许多随机数字,这些数字与实际写入文件的内容无关......

2 个答案:

答案 0 :(得分:2)

普通Python变量不能在进程之间共享,因此池中的每个工作进程都会得到自己的searched_counterprocessed_counter副本,因此在一个进程中递增它们会赢得&# 39;对其他人有任何影响。 multiprocessing库有a few ways to share state between processes,但对您的用例最简单的就是使用multiprocessing.Value

from multiprocessing import Pool, Value
from sys import stdout

def init(s, p):
    global searched_counter, processed_counter
    searched_counter = s
    processed_counter = p

def run_scrape(var_input):
    global searched_counter, processed_counter
    #get search results
    parsed = #parse using bs4

    with searched_counter.get_lock():
        searched_counter.value += 1
    stdout.write("\rTotal Searched/Processed: %d/%d" % 
                    (searched_counter.value, processed_counter.value))
    stdout.flush()

    if parsed:
        with processed_counter.get_lock():
            processed_counter.value += 1
        stdout.write("\rTotal Searched/Processed: %d/%d" % 
                        (searched_counter.value, processed_counter.value))
        stdout.flush()    


if __name__ == "__main__":
    searched_counter = Value('i', 0)
    processed_counter = Value('i', 0)

    list_to_scrape = ["data%05d" % (x,) for x in range(1,10000)]
    pool = Pool(8, initializer=init, initargs=(searched_counter, processed_counter))
    pool.map(run_scrape, list_to_scrape)

    stdout.write('\n')

请注意,我使用initializer / initargs关键字参数显式将计数器从父进程传递给子进程,该参数是best practice,有助于确保Windows兼容性。

答案 1 :(得分:0)

将列表拆分为某个大小的组(n)(可能在池中生成多个数字),然后遍历该超级列表,为每个列表创建一个新池。您可以在遍历超级列表时进行计数。