Question

因此，我有一些我正在从事的预测验证项目的功能代码（在下面发布）。该代码将遍历NetCDF文件列表（对于我预测的每个开始日期一个），并将无论我选择分配的内存（无论我使用的当前计算机是什么）都放入内存。

一旦我需要执行预测验证的数据在内存中，我就使用Numba创建了一个非常快的例程，尤其是在python中（它使用LLVM编译器基础结构）。该代码的主要瓶颈是分别读取每个NetCDF文件并从中提取数据需要花费多长时间。

我发现此功能here似乎可以帮助我加快此过程，我的问题是，是否可以在Python中实现这样的功能来减少花费的时间？只是用Xarray读取一堆文件？

代码

注意：为了简洁起见，我已经删除了一些代码，只是为了大致了解我在做什么。我已经评论了我的代码最慢的部分是我遍历每个文件以从中提取数据的地方。

import xarray as xr
import numpy as np
import os
import numba as nb


def compute_all(work_dir, memory_to_allocate_gb):
    array_size_bytes = 3060  # Based on 15 x 51 member array
    memory_to_allocate_bytes = memory_to_allocate_gb * 1e9

    # Compute a bunch of stuff regarding memory allocation here

    for chunk_number in range(num_chunk_iterations):

        ### >>>>>>> This is where the code is slowest <<<<<<<<<
        for file_number, file in enumerate(files):

            print("\tFile Number: ", file_number)

            tmp_dataset = xr.open_dataset(file)

            tmp_forecast_array = tmp_dataset["Qout"].data[start_chunk:end_chunk, :, :]
            tmp_initialization_array = tmp_dataset["initialization_values"].data[start_chunk:end_chunk]
            tmp_dataset.close()

            for forecast_day in range(15):
                big_forecast_data_array[forecast_day, :, file_number, :] = tmp_forecast_array[:, forecast_day, :]

            big_initialization_array[file_number, :] = tmp_initialization_array

        rivids_chunk = rivids[start_chunk:end_chunk]

        # >>>> This is the fast LLVM compiled part <<<<<
        results_array = numba_calculate_metrics(
            big_forecast_data_array, big_initialization_array, len(files), chunk_size, 15
        )


if __name__ == "__main__":

    workspace = r'/directory/with/all/files'
    MEMORY_TO_ALLOCATE = 2.0  # GB
    compute_all(workspace, MEMORY_TO_ALLOCATE)

用于NetCDF的Xarray并行I / O

0 个答案: