Question

我正在使用Python脚本处理大型CSV文件（大约10 GB行的GB）。

文件具有不同的行长度，无法完全加载到内存中进行分析。

每一行都由我的脚本中的函数单独处理。分析一个文件大约需要20分钟，看起来磁盘访问速度不是问题，而是处理/函数调用。

代码看起来像这样（非常简单）。实际代码使用Class结构，但这类似：

csvReader = csv.reader(open("file","r")
for row in csvReader:
   handleRow(row, dataStructure)

鉴于计算需要共享数据结构，在使用多核的Python中并行运行分析的最佳方法是什么？

一般来说，如何从Python中的.csv一次读取多行以传输到线程/进程？在行上使用for循环听起来效率不高。

谢谢！

Answer 1

这可能为时已晚，但对于未来的用户，无论如何我都会发布。另一张提到使用多处理的海报。我可以保证它，并可以更详细。我们每天使用Python处理数百MB /几GB的文件。所以这绝对取决于任务。我们处理的一些文件不是CSV，因此解析可能相当复杂，并且需要比磁盘访问更长的时间。但是，无论使用何种文件类型，方法都是相同的。

您可以同时处理大型文件的各个部分。这是我们如何做的伪代码：

import os, multiprocessing as mp

# process file function
def processfile(filename, start=0, stop=0):
    if start == 0 and stop == 0:
        ... process entire file...
    else:
        with open(file, 'r') as fh:
            fh.seek(start)
            lines = fh.readlines(stop - start)
            ... process these lines ...

    return results

if __name__ == "__main__":

    # get file size and set chuck size
    filesize = os.path.getsize(filename)
    split_size = 100*1024*1024

    # determine if it needs to be split
    if filesize > split_size:

        # create pool, initialize chunk start location (cursor)
        pool = mp.Pool(cpu_count)
        cursor = 0
        results = []
        with open(file, 'r') as fh:

            # for every chunk in the file...
            for chunk in xrange(filesize // split_size):

                # determine where the chunk ends, is it the last one?
                if cursor + split_size > filesize:
                    end = filesize
                else:
                    end = cursor + split_size

                # seek to end of chunk and read next line to ensure you 
                # pass entire lines to the processfile function
                fh.seek(end)
                fh.readline()

                # get current file location
                end = fh.tell()

                # add chunk to process pool, save reference to get results
                proc = pool.apply_async(processfile, args=[filename, cursor, end])
                results.append(proc)

                # setup next chunk
                cursor = end

        # close and wait for pool to finish
        pool.close()
        pool.join()

        # iterate through results
        for proc in results:
            processfile_result = proc.get()

    else:
        ...process normally...

像我说的那样，那只是伪代码。它应该让任何人开始需要做类似的事情。我没有在我面前的代码，只是从内存中做到。

但是在第一次运行时我们的速度提高了2倍以上而没有进行微调。您可以根据您的设置微调池中的进程数以及块的大小以获得更高的速度。如果您有多个文件，请创建一个池以并行读取多个文件。小心不要用太多的进程重载盒子。

注意：您需要将其放在“if main”块中，以确保不会创建无限进程。

Answer 2

由于GIL，Python的线程不会加速处理器绑定的计算，就像IO绑定一样。

相反，请查看可以在多个处理器上并行运行代码的multiprocessing module。

Answer 3

尝试对您的文件进行基准测试并解析每个CSV行，但不对其执行任何操作。您排除了磁盘访问权限，但您仍然需要查看CSV解析是否缓慢或者您自己的代码是否缓慢。

如果CSV解析速度很慢，你可能会陷入困境，因为我认为没有办法跳到CSV文件的中间而不扫描到那一点。

如果它是您自己的代码，那么您可以让一个线程读取CSV文件并将行放入队列，然后让多个线程处理该队列中的行。但是，如果CSV解析本身正在使它变慢，那么不要理会这个解决方案。

Answer 4

如果行完全独立，只需将输入文件拆分为与您拥有的CPU一样多的文件。之后，您可以运行与现在输入文件一样多的进程实例。这个实例，因为它们是完全不同的进程，不受GIL问题的约束。

Answer 5

如果您使用zmq和DEALER中间人，您不仅可以将行处理扩展到计算机上的CPU，还可以通过网络将行处理扩展到必要的流程。这基本上保证您达到IO限制与CPU限制：）

Answer 6

Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.

(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)

import multiprocessing as mp
import csv

CHUNKSIZE = 10000   # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
   with open(csvfname) as csvf, \
        open(csvoutfname, 'w') as csvout\
        mp.Pool() as p:
       reader = csv.reader(csvf)
       csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))

在Python中并行处理大型.csv文件

6 个答案: