Question

任务：生成数值天气预报模型生成的数据集中包含的一些变量的图（通常是填充/未填充轮廓的组合）。这需要自动化，因为大约每6个小时就会下载一次新数据，并产生新的绘图并将其上传到FTP以便尽快显示。

我的解决方案和问题：出于速度原因，在bash中编写了负责下载和准备文件的驱动程序脚本。由于在服务器上，数据被拆分为单个文件，包含一个时刻和一个级别，因此该阶段使用（gnu）parallel进行并行化。此阶段非常快且已优化。但是，下一阶段涉及到使用python（主要是matplotlib.pyplot绘制数据）仍然是最长的阶段，并且需要大量内存。

绘图脚本剖析：我有2个部分，一个main函数，其中对输入文件进行的所有预处理仅进行一次，以使用与plot_files函数（仅绘制数据）。

import xarray as xr 
from multiprocessing import Pool
from functools import partial

def main():
    file = glob(input_file)
    dset = xr.open_dataset(file[0])

    # Do some pre-processing and get the variables/coordinates
    # from the file
    var = dset.var

接下来，main函数使用multiprocessing跨越绘图函数的多个实例。在这里创建matplotlib的图形实例以及basemap投影，以便只执行一次：绘图功能将负责将元素添加到绘图中，并在每个时间步将其删除下一次迭代

    fig = plt.figure()
    ax  = plt.gca()
    m, x, y = get_projection(lon2d, lat2d, projection)

    # All the arguments that need to be passed to the plotting function
    args = dict(m=m, x=x, y=y, ax=ax,
                  var=var, time=time, projection=projection)

    # Parallelize the plotting by dividing into chunks and processes 
    dates = chunks(time, chunks_size)
    plot_files_param = partial(plot_files, **args)
    p = Pool(processes)
    p.map(plot_files_param, dates)

随着时间的推移进行并行处理是很自然的，因为您必须在多个时间步上生成相同的图，但是已经可以使用数据了！

绘图功能的作用尽可能小：获取变量，通过覆盖不同级别（必要时）进行绘图，并在导出图形以准备下一个时间步时删除元素。

def plot_files(dates, **args):
    for date in dates:
        # Find index in the original array to subset when plotting
        i = np.argmin(np.abs(date - args['time'])) 
        cs = args['ax'].contourf(args['x'], args['y'], args['var'][i])

        plt.savefig(filename, **options_savefig)        

        remove_collections([cs])

remove_collections会删除所有元素（可能是轮廓线，标签，注释...）。

目前，我有一个脚本可以运行于我制作的每种不同类型的绘图中，而所有通用功能都是从utils.py模块导入的。

我确定这不是并行绘图的最明智的实现。我现在看到的缺点是内存的使用，可以通过仅在某个时间步将数组传递给plot_files函数来减少内存的使用，但是我不确定一个数组是否可以与multiprocessing并行化。尺寸可变。

有人有改进我代码的技巧吗？我试图将示例简化为核心，但是当然还有更多细节。可以在这里https://github.com/guidocioni/icon_forecasts

看到涉及此脚本的项目之一。

用于从N维数据集中进行数据的全自动，时间并行绘图的优化技巧

0 个答案: