我对Python多处理有疑问。
我有一个大型的csv文件:test.csv
,具有200万行和2列:firm_id
,product_id
,其中最后一个,即product_id
是输入func1(product_id)
这样的另一个功能。
这就是基本信息,因为文件非常大,每个product_id
都可以独立处理,所以我想利用Python的多处理功能,这是我以前从未接触过的。谷歌搜索了一段时间后,我发现了一些有用的信息(例如this和this),但没有一个使我能够完成任务。我尝试了最后一个,并进行了如下所示的编辑,但是没有用,
import itertools as IT
import multiprocessing as mp
import csv
import funcitons as fdfunc # a self defined module with function func1 in it
def worker(chunk):
return len(chunk)
def main(): # num_procs is the number of workers in the pool
num_procs = 2
# chunksize is the number of lines in a chunk
chunksize = 10**5
pool = mp.Pool(num_procs)
largefile = 'test.csv'
results = []
with open(largefile, 'r') as f,\
open('file_to_store_resutl.csv','a+') as res_file:
reader = csv.reader(f)
for chunk in iter(lambda: list(IT.islice(reader, chunksize*num_procs)), []):
chunk = iter(chunk)
pieces = list(iter(lambda: list(IT.islice(chunk, chunksize)), []))
result = pool.imap(fdfunc.func1, pieces['product_id']) #pieces['product_id'] this definitely is wrong, just to show what I want to do
writer = csv.writer(res_file)
for item in result:
writer.write_row(item)
results.append(result)
main()
有人知道我该怎么做吗?