Question

我对Python多处理有疑问。

我有一个大型的csv文件：test.csv，具有200万行和2列：firm_id，product_id，其中最后一个，即product_id是输入func1(product_id)这样的另一个功能。

这就是基本信息，因为文件非常大，每个product_id都可以独立处理，所以我想利用Python的多处理功能，这是我以前从未接触过的。谷歌搜索了一段时间后，我发现了一些有用的信息（例如this和this），但没有一个使我能够完成任务。我尝试了最后一个，并进行了如下所示的编辑，但是没有用，

import itertools as IT
import multiprocessing as mp
import csv
import funcitons as fdfunc # a self defined module with function func1 in it

def worker(chunk):
    return len(chunk)  


def main():    # num_procs is the number of workers in the pool
    num_procs = 2
    # chunksize is the number of lines in a chunk
    chunksize = 10**5

    pool = mp.Pool(num_procs)
    largefile = 'test.csv'
    results = []
    with open(largefile, 'r') as f,\
    open('file_to_store_resutl.csv','a+') as res_file:
        reader = csv.reader(f)       

        for chunk in iter(lambda: list(IT.islice(reader, chunksize*num_procs)), []):
            chunk = iter(chunk)
            pieces = list(iter(lambda: list(IT.islice(chunk, chunksize)), []))

            result = pool.imap(fdfunc.func1, pieces['product_id']) #pieces['product_id'] this definitely is wrong, just to show what I want to do
            writer = csv.writer(res_file)
            for item in result:
                writer.write_row(item)

            results.append(result)
main()

有人知道我该怎么做吗？

大型CSV文件，其中只有一列用作使用多处理功能的参数

0 个答案: