我正在尝试使用numpy从大型tif文件中提取分层随机样本。由于文件很大,因此要立即在内存中读取它的存储需求也很大。我正在使用生成器来分块读取它。而且我找到了一种通过限制队列的最大大小来使用生成器(伪)的方法(如本文中的建议:Python multiprocessing with generator)
但是问题是worker函数返回值的字典。我无法弄清楚如何从worker函数的值返回到主进程,并没有字典的列表。
这是我的相关代码:
import multiprocessing as mp
def process_worker(args):
tile_id, levels, tie_pt, tile_arr, pixel_size, nsamp = args
level_dict = get_coords(levels, tie_pt, tile_arr, ##function separately written
pixel_size, nsamp,
pixel_center=True)
if len(level_dict) > 0:
q_out.put(level_dict)
if __name__ == '__main__':
infile = "large_file.tif"
nsamp = 100
nprocs = 4
levels = range(0, 100, 1)
q_in = mp.Queue(maxsize=nprocs)
q_out = mp.Queue()
pool = mp.Pool(nprocs, initializer=process_worker, initargs=(q_in,))
raster = Raster(infile) ## class separately written
pixel_size = raster.pixel_size
tile_count = 1
for tie_pt, tile_arr in raster.get_next_tile(): ## generator to get file chunks as top left coordinate and numpy array
tile_coords = get_bounds(tie_pt, tile_arr) ## separately written function
q_in.put((tile_count, levels, tie_pt, tile_arr, pixel_size, nsamp))
tile_count += 1
pool.close()
results = q_out.get() ## this should be a list of dictionaries