Question

我正在尝试提高解析大文本文件（1-100gb）数据的脚本的性能。我想我会给多处理一个去看看是否会加快速度。据我所知，流程开始很好，但它比没有多处理的流程慢约3倍。

多处理版本：

from multiprocessing import Lock, Process
from datetime import datetime

def worker(mylist, count):
    outFile = str(count) + '.txt'
    out = open(outFile,'w')
    for i in mylist:
        out.write(i)           

def main():
    ##    lock = Lock()
    startTime = datetime.now()
    jobs = []
    tempList = []
    count = 0
    inFile = open('batch1.kscsv','r')
    for line in inFile:
        if('Traversal' in line and len(tempList) == 0):
            traversalString = line
        if('Traversal' not in line and 'Spot' not in line and 'XValue' not in line):
            line = line.replace(',',' ')
            tempList.append(line)
        if('Traversal' in line and len(tempList) > 0):
            spotFromFile = (traversalString.split(',')[1]).strip()
            count += 1
            p = Process(target=worker, args=(tempList, count,))
            p.start()
            tempList = []
            traversalString = line

    print ('Run took: ' + str((datetime.now()-startTime)))


if __name__ == '__main__':
    main()

常规脚本：

from datetime import datetime      

def main():
    startTime = datetime.now()
    jobs = []
    tempList = []
    count = 0
    inFile = open('batch1.kscsv','r')
    for line in inFile:
        if('Traversal' in line and len(tempList) == 0):
            traversalString = line
        if('Traversal' not in line and 'Spot' not in line and 'XValue' not in line):
            line = line.replace(',',' ')
            tempList.append(line)
        if('Traversal' in line and len(tempList) > 0):
            spotFromFile = (traversalString.split(',')[1]).strip()
            count += 1

            outFile = str(count) + '.txt'
            out = open(outFile,'w')
            for i in tempList:
                out.write(i)

            tempList = []
            traversalString = line


    print ('Run took: ' + str((datetime.now()-startTime)))


if __name__ == '__main__':
    main()

这是一个不太适合多处理的问题吗？或者有没有办法改善多处理？

Answer 1

使用多个进程将帮助您处理受CPU限制的任务。在这种情况下，您委派的函数只是打开一个文件并写入它。这可能主要是IO绑定，并且您只有更多进程争用磁盘资源。

此外，分叉的成本相当高，这会使情况进一步减慢。使用进程池而不是为每个调用创建一个新进程会更有效率，因为在池中每个进程只进行一次设置。

def worker(mylist, count):
    outFile = str(count) + '.txt' # a few variable and string operations, CPU, fast
    out = open(outFile,'w') # system call, won't benefit from multiprocessing much, slow
    for i in mylist: # increment a counter, CPU, fast
        out.write(i) # write to file, IO, slow

在这里：

# forking, fixed cost, expensive, adds work for the kernel in tracking COW or consumes lots of memory depending on OS.
p = Process(target=worker, args=(tempList, count,))
p.start()

如果worker函数实际上必须对输入做很多工作，那么多处理将开始是有益的。树搜索，计算等。然而，再一次，有一个固定数量的进程在队列中运行而不是为每个分析产生一个新进程（除非它们很少），这将是有益的。

多处理文件输出明显变慢

1 个答案: