为什么multiprocessing.Pool和multiprocessing.Process在Linux中的执行方式如此不同

时间:2017-05-26 18:12:54

标签: python linux multiprocessing

我运行了一些测试代码,以检查在Linux中使用Pool和Process的性能。我正在使用Python 2.7。 multiprocessing.Pool的源代码似乎显示它正在使用multiprocessing.Process。但是,multiprocessing.Pool花费了大量的时间和内存,而不是等于多处理的进程。过程,我没有得到这个。

这是我做的:

  1. 创建一个大型dict,然后创建子进程。

  2. 将dict传递给每个子进程以进行只读。

  3. 每个子进程都进行一些计算并返回一个小结果。

  4. 以下是我的测试代码:

    from multiprocessing import Pool, Process, Queue
    import time, psutil, os, gc
    
    gct = time.time
    costTime = lambda ET: time.strftime('%H:%M:%S', time.gmtime(int(ET)))
    
    def getMemConsumption():
        procId = os.getpid()
        proc = psutil.Process(procId)
        mem = proc.memory_info().rss
        return "process ID %d.\nMemory usage: %.6f GB" % (procId, mem*1.0/1024**3)
    
    def f_pool(l, n, jobID):
        try:
            result = {}
            # example of subprocess work
            for i in xrange(n):
                result[i] = l[i]
            # work done
            # gc.collect()
            print getMemConsumption()
            return 1, result, jobID
        except:
            return 0, {}, jobID
    
    def f_proc(q, l, n, jobID):
        try:
            result = {}
            # example of subprocess work
            for i in xrange(n):
                result[i] = l[i]
            # work done
            print getMemConsumption()
            q.put([1, result, jobID])
        except:
            q.put([0, {}, jobID])
    
    def initialSubProc(targetFunc, procArgs, jobID):
        outQueue = Queue()
        args = [outQueue]
        args.extend(procArgs)
        args.append(jobID)
        p = Process(target = targetFunc, args = tuple(args))
        p.start()
        return p, outQueue
    
    
    def track_add_Proc(procList, outQueueList, maxProcN, jobCount, 
                       maxJobs, targetFunc, procArgs, joinFlag, all_result):
        if len(procList) < maxProcN:
            p, q = initialSubProc(targetFunc, procArgs, jobCount)
            outQueueList.append(q)
            procList.append(p)
            jobCount += 1
            joinFlag.append(0)
        else:
            for i in xrange(len(procList)):
                if not procList[i].is_alive() and joinFlag[i] == 0:
                    procList[i].join()
                    all_results.append(outQueueList[i].get())
                    joinFlag[i] = 1 # in case of duplicating result of joined subprocess
                    if jobCount < maxJobs:
                        p, q = initialSubProc(targetFunc, procArgs, jobCount)
                        procList[i] = p
                        outQueueList[i] = q
                        jobCount += 1
                        joinFlag[i] = 0
        return jobCount
    
    if __name__ == '__main__':
        st = gct()
        d = {i:i**2 for i in xrange(10000000)}
        print "MainProcess create data dict\n%s" % getMemConsumption()
        print 'Time to create dict: %s\n\n' % costTime(gct()-st)
    
        nproc = 2
        jobs = 8
        subProcReturnDictLen = 1000
        procArgs = [d, subProcReturnDictLen]
    
        print "Use multiprocessing.Pool, max subprocess = %d, jobs = %d\n" % (nproc, jobs)
        st = gct()
        pool = Pool(processes = nproc)
        for i in xrange(jobs):
            procArgs.append(i)
            sp = pool.apply_async(f_pool, tuple(procArgs))
            procArgs.pop(2)
            res = sp.get()
            if res[0] == 1:
                # do something with the result
                pass
            else:
                # do something with subprocess exception handle
                pass
        pool.close()
        pool.join()
        print "Total time used to finish all jobs: %s" % costTime(gct()-st)
        print "Main Process\n", getMemConsumption(), '\n'
    
        print "Use multiprocessing.Process, max subprocess = %d, jobs = %d\n" % (nproc, jobs)
        st = gct()
        procList = []
        outQueueList = []
        all_results = []
        jobCount = 0
        joinFlag = []
        while (jobCount < jobs):
            jobCount = track_add_Proc(procList, outQueueList, nproc, jobCount, 
                                      jobs, f_proc, procArgs, joinFlag, all_results)
        for i in xrange(nproc):
            if joinFlag[i] == 0:
                procList[i].join()
                all_results.append(outQueueList[i].get())
                joinFlag[i] = 1
        for i in xrange(jobs):
            res = all_results[i]
            if res[0] == 1:
                # do something with the result
                pass
            else:
                # do something with subprocess exception handle
                pass
        print "Total time used to finish all jobs: %s" % costTime(gct()-st)
        print "Main Process\n", getMemConsumption()
    

    结果如下:

    MainProcess create data dict
    process ID 21256.
    Memory usage: 0.841743 GB
    Time to create dict: 00:00:02
    
    
    Use multiprocessing.Pool, max subprocess = 2, jobs = 8
    
    process ID 21266.
    Memory usage: 1.673084 GB
    process ID 21267.
    Memory usage: 1.673088 GB
    process ID 21266.
    Memory usage: 2.131172 GB
    process ID 21267.
    Memory usage: 2.131172 GB
    process ID 21266.
    Memory usage: 2.176079 GB
    process ID 21267.
    Memory usage: 2.176083 GB
    process ID 21266.
    Memory usage: 2.176079 GB
    process ID 21267.
    Memory usage: 2.176083 GB
    
    Total time used to finish all jobs: 00:00:49
    Main Process
    process ID 21256.
    Memory usage: 0.843079 GB 
    
    
    Use multiprocessing.Process, max subprocess = 2, jobs = 8
    
    process ID 23405.
    Memory usage: 0.840614 GB
    process ID 23408.
    Memory usage: 0.840618 GB
    process ID 23410.
    Memory usage: 0.840706 GB
    process ID 23412.
    Memory usage: 0.840805 GB
    process ID 23415.
    Memory usage: 0.840900 GB
    process ID 23417.
    Memory usage: 0.840973 GB
    process ID 23419.
    Memory usage: 0.841061 GB
    process ID 23421.
    Memory usage: 0.841152 GB
    
    Total time used to finish all jobs: 00:00:00
    Main Process
    process ID 21256.
    Memory usage: 0.843781 GB
    

    我不知道为什么来自multiprocessing.Pool的子进程在开始时需要大约1.6GB,但是来自multiprocessing.Process的子进程只需要0.84 GB,这等于主进程的内存开销。在我看来,只有multiprocessing.Process享有linux的“copy-on-write”优势,因为所需的所有工作的时间都不到1秒。我不知道为什么multiprocessing.Pool不喜欢这个。从源代码中,multiprocessing.Pool看起来像是multiprocessing.Process。

    的包装器

1 个答案:

答案 0 :(得分:0)

  

问题:我不知道为什么多处理的子进程.Pool一开始需要大约1.6GB,
  ...池似乎是multiprocessing.Process的包装器

这是Pool为所有工作的结果保留记忆。
  其次,Pool使用两个 SimpleQueue()三个 Threads
  第三,Pool复制所有传递的argv数据,然后传递给process

您的process示例仅对所有使用一个 Queue(),并按原样传递argv

Pool远非只是一个包装。