使用单个工作程序的Python多处理比顺序操作更快

时间:2018-04-01 12:17:06

标签: python multithreading multiprocessing runtime pool

简要概述 - 我写了一些带有大量随机数的随机文件到光盘来测试python多处理与顺序操作的性能。

  

功能描述

putfiles :将测试文件写入驱动器

readFile :读取传递的文件位置并返回结果(代码中的数字总和)

getSequential :使用for循环

读取一些文件

getParallel :读取包含多个进程的文件

  

性能结果:(读取并处理100个文件,包括顺序和进程池)

timeit getSequential(numFiles = 100) - 最好的2.85s

timeit getParallel(numFiles = 100,numProcesses = 4)-960ms最佳

timeit getParallel(numFiles = 100,numProcesses = 1) - 最好980ms

令人惊讶的是,单个进程池的性能优于顺序,并且与4个进程池相同。这种行为是期待还是我在这里做错了什么?

import os
import random
from multiprocessing import Pool

os.chdir('/Users/test/Desktop/filewritetest')

def putfiles(numFiles=5, numCount=100):
    #numFiles = int(input("how many files?: "))
    #numCount = int(input('How many random numbers?: '))
    for num in range(numFiles):
        with open('r' + str(num) + '.txt', 'w') as f:
            f.write("\n".join([str(random.randint(1, 100)) for i in range(numCount)]))

def readFile(fileurl):
    with open(fileurl, 'r') as f, open("ans_" + fileurl, 'w') as fw:
        fw.write(str((sum([int(i) for i in f.read().split()]))))

def getSequential(numFiles=5):
    #in1 = int(input("how many files?: "))
    for num in range(numFiles):
        (readFile('r' + str(num) + '.txt'))


def getParallel(numFiles=5, numProcesses=2):
    #numFiles = int(input("how many files?: ")) 
    #numProcesses = int(input('How many processes?: '))
    with Pool(numProcesses) as p:
        p.map(readFile, ['r' + str(num) + '.txt' for num in range(numFiles)])


#putfiles()

putfiles(numFiles=1000, numCount=100000)

timeit getSequential(numFiles=100)
##around 2.85s best

timeit getParallel(numFiles=100, numProcesses=1)
##around 980ms best
timeit getParallel(numFiles=100, numProcesses=4)
##around 960ms best
  

更新:在sypder的新会话中,我没有看到这个问题。更新了

下的运行时
##100 files
#around 2.97s best
timeit getSequential(numFiles=100)

#around 2.99s best
timeit getParallel(numFiles=100, numProcesses=1)

#around 1.57s best
timeit getParallel(numFiles=100, numProcesses=2)

#around 942ms best
timeit getParallel(numFiles=100, numProcesses=4)

##1000 files
#around 29.3s best
timeit getSequential(numFiles=1000)

#around 11.8s best
timeit getParallel(numFiles=1000, numProcesses=4)

#around 9.6s best
timeit getParallel(numFiles=1000, numProcesses=16)

#around 9.65s best  #let pool choose best default value
timeit getParallel(numFiles=1000)

1 个答案:

答案 0 :(得分:0)

请不要将此视为一个答案,它是为了在python 3.x中运行这些东西时显示我的代码(你的timeit用法对我来说根本不起作用,我认为它是2.x)。抱歉,我现在没有时间深入研究它。

旋转驱动器上的

[EDIT],考虑磁盘缓存:不要在不同的测试中访问相同的文件,或者只是切换测试的顺序以查看是否涉及磁盘缓存

使用以下代码,手动更改numProcesses = X参数,我得到了以下结果:

在SSD上,1000个顺序为0.31秒,1000个并列​​为1个螺纹为0.37秒,使用4个螺纹为0.23 1000个并列​​

import os
import random
import timeit
from multiprocessing import Pool
from contextlib import closing

os.chdir('c:\\temp\\')

def putfiles(numFiles=5, numCount=1):
    #numFiles = int(input("how many files?: "))
    #numCount = int(input('How many random numbers?: '))
    for num in range(numFiles):
        #print("num: " + str(num))
        with open('r' + str(num) + '.txt', 'w') as f:
            f.write("\n".join([str(random.randint(1, 100)) for i in range( numCount )]))
    #print ("pufiles done")

def readFile(fileurl):
    with open(fileurl, 'r') as f, open("ans_" + fileurl, 'w') as fw:
        fw.write(str((sum([int(i) for i in f.read().split()]))))


def getSequential(numFiles=10000):
   # print ("getSequential, nufile: " + str (numFiles))
    #in1 = int(input("how many files?: "))
    for num in range(numFiles): 
        #print ("getseq for")
        (readFile('r' + str(num) + '.txt'))
    #print ("getSequential done")


def getParallel(numFiles=10000, numProcesses=1):
    #numFiles = int(input("how many files?: ")) 
    #numProcesses = int(input('How many processes?: '))
    #readFile, ['r' + str(num) + '.txt' for num in range(numFiles)]
    #with Pool(10) as p:
    with closing(Pool(processes=1)) as p:
       p.map(readFile, ['r' + str(num) + '.txt' for num in range(numFiles)])

if __name__ == '__main__':
    #putfiles(numFiles=10000, numCount=1)

    print (timeit.timeit ("getSequential()","from __main__ import getSequential",number=1))

    print (timeit.timeit ("getParallel()","from __main__ import getParallel",number=1)) 

#timeit (getParallel(numFiles=100, numProcesses=4)) #-around 960ms best

#timeit (getParallel(numFiles=100, numProcesses=1)) #-around 980ms best