我正在阅读数百个HDF文件并分别处理每个HDF的数据。但是,这需要花费大量时间,因为它一次只能处理一个HDF文件。我只是偶然发现了http://docs.python.org/library/multiprocessing.html,现在我想知道如何使用多处理加快速度。
到目前为止,我提出了这个问题:
import numpy as np
from multiprocessing import Pool
def myhdf(date):
ii = dates.index(date)
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
results[ii] = np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))
pool = Pool(len(dates))
pool.map(myhdf,dates)
然而,这显然是不正确的。你能跟我思考我想做什么吗?我需要改变什么?
答案 0 :(得分:4)
尝试joblib以获得更友好的multiprocessing
包装:
from joblib import Parallel, delayed
def myhdf(date):
# do work
return np.mean(records)
results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)
答案 1 :(得分:2)
Pool classes map函数就像标准的python库map
函数一样,你可以保证按照你输入它们的顺序得到你的结果。知道这个,唯一的另一个技巧是你需要以一致的方式返回结果,然后过滤它们。
import numpy as np
from multiprocessing import Pool
def myhdf(date):
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
return np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)
如果您确实希望获得结果,可以使用imap_unordered