Dask - Rechunk或数组切片导致大量内存使用?

时间:2016-10-25 13:37:57

标签: python windows memory fft dask

下午好,

我正在寻找一些帮助来理解我的Dask处理链中的一些过多(或可能不是)的内存使用情况。

问题来自执行以下功能:

def create_fft_arrays(master_array, fft_size, overlap):

    input_shape = master_array.shape[0]
    # Determine zero pad length
    zero_len = fft_size - ((input_shape - fft_size) % ((1-overlap) * fft_size))

    zeros = da.zeros((zero_len, master_array.shape[1]),
                     dtype = master_array.dtype,
                     chunks = (zero_len, master_array.shape[1]))
    # Create the reshaped array
    reshape_array = da.concatenate((master_array, zeros), axis = 0)
    # Create an index series to use to index the reshaped array for re-blocking.
    fft_index = np.arange(0, reshape_array.shape[0] - (fft_size -1), fft_size * overlap)
    # Break reshape_array into fft size chunks
    fft_arrays = [reshape_array[x:x + fft_size] for x in fft_index]

    # Returns list of dask arrays
    return [array.rechunk(array.shape) for array in fft_arrays]

master_array Dask Array太大而无法容纳在内存中(在此情况下为703,57600001点)。

作为最小示例,以下内容会导致与下面的完整代码使用相同的内存:

import dask.array as da
import numpy as np

def create_fft_arrays(master_array, fft_size, overlap):

    input_shape = master_array.shape[0]
    # Determine zero pad length
    zero_len = fft_size - ((input_shape - fft_size) % ((1-overlap) * fft_size))

    zeros = da.zeros((zero_len, master_array.shape[1]),
                     dtype = master_array.dtype,
                     chunks = (zero_len, master_array.shape[1]))
    # Create the reshaped array
    reshape_array = da.concatenate((master_array, zeros), axis = 0)
    # Create an index series to use to index the reshaped array for re-blocking.
    fft_index = np.arange(0, reshape_array.shape[0] - (fft_size -1), fft_size * overlap)
    # Break reshape_array into fft size chunks
    fft_arrays = [reshape_array[x:x + fft_size] for x in fft_index]

    # Returns list of dask arrays
    return [array.rechunk(array.shape) for array in fft_arrays]

# Fabricate an input array of the same shape and size as the problematic dataset
master_array = da.random.normal(10, 0.1, size = (703, 57600001), chunks = (703, 372))

# Execute the create_fft_arrays function
fft_arrays = create_fft_arrays(master_array.T, 2**15, 0.5)

要将代码置于上下文中,执行以下代码会导致我的RAM(20Gb)在执行最后一行fft_arrays = create_fft_arrays(master_array.T, FFT_SIZE, 0.5)时最大化:

import dask.array as da

import h5py as h5
import numpy as np

import os

FORMAT = '.h5'
DSET_PATH = '/DAS/Data'
TSET_PATH = '/DAS/Time'

FFT_SIZE = 2**15
OVERLAP = 0.5

input_dir = r'D:\'
file_paths = []

# Get list of all valid files in directory
for dir_name, sub_dir, f_name in os.walk(input_dir):
    for f in f_name:
        if f[-1*len(FORMAT):] == FORMAT:
            file_paths.append(os.path.join(dir_name, f))

#H5 object for each file
file_handles = [h5.File(f_path, 'r') for f_path in file_paths]

# Handle for dataset and timestamps from each file
dset_handles = [f[DSET_PATH] for f in file_handles]
tset_handles = [f[TSET_PATH] for f in file_handles]

# Create a Dask Array object for each dataset and timestamp set
dset_arrays = [da.from_array(dset, chunks = dset.chunks) for dset in dset_handles]
tset_arrays = [da.from_array(tset, chunks = tset.chunks) for tset in tset_handles]

# Concatenate all datasets along along the time axis
master_array = da.concatenate(dset_arrays, axis = 1)

def create_fft_arrays(master_array, fft_size, overlap):

    input_shape = master_array.shape[0]
    # Determine zero pad length
    zero_len = fft_size - ((input_shape - fft_size) % ((1-overlap) * fft_size))

    zeros = da.zeros((zero_len, master_array.shape[1]),
                     dtype = master_array.dtype,
                     chunks = (zero_len, master_array.shape[1]))
    # Create the reshaped array
    reshape_array = da.concatenate((master_array, zeros), axis = 0)
    # Create an index series to use to index the reshaped array for re-blocking.
    fft_index = np.arange(0, reshape_array.shape[0] - (fft_size -1), fft_size * overlap)
    # Break reshape_array into fft size chunks
    fft_arrays = [reshape_array[x:x + fft_size] for x in fft_index]

    # Returns list of dask arrays
    return [array.rechunk(array.shape) for array in fft_arrays]

# Break master_array into FFT sized arrays with a single chunk in each
fft_arrays = create_fft_arrays(master_array.T, FFT_SIZE, 0.5)

在此之后,我将继续使用da.fft.fft方法计算每个fft阵列的频率响应。

非常感谢任何帮助或建议,

乔治

1 个答案:

答案 0 :(得分:0)

你的主数组有很多块

>>> master_array = da.random.normal(10, 0.1, size = (703, 57600001), chunks = (703, 372))
>>> master_array.npartitions
154839

每个块都有一些管理开销,因此保持数量小于此值是件好事。这是来自section on chunks

dask.array documentation

当您尝试数千次切入此阵列时,就会出现瓶颈。

通过增加chunksize可以在一定程度上解决您的问题。与上述相关的文件提供了一些建议。