Question

我正在尝试使用多处理程序加载h5py文件。基本上，我有100个.h5文件。它们每个都包含两个(100, 40000, 4)和(100, 50, 3)形状的numpy数组。

我正在使用joblib进行多重处理，如下面的代码所示

import h5py
import numpy as np
from multiprocessing import cpu_count
from joblib import Parallel, delayed

def read(i):
    print('processing %d'% i)
    hf = h5py.File('%.h5' % i, 'r')
    x = np.array(hf.get('x'))
    y = np.array(hf.get('y'))
    hf.close()
    #print(x.shape) # (100, 40000, 4)
    #print(y.shape) # (100, 50, 3)
    return x, y

def batch_read():
    num_cores = cpu_count()
    print('loading arrays using %d cores...' % num_cores)
    data = Parallel(n_jobs=num_cores)(delayed(read)(i) for i in range(100))
    #x, y = concatenate_1(data)
    x, y = concatenate_2(data)

# approach 1
def concatenate_1(data):
    for index, value in enumerate(data):
        if index == 0:
            x = value[0].copy()
            y = value[1].copy()
        else:
            x = np.concatenate((x, value[0]), axis=0)
            y = np.concatenate((y, value[1]), axis=0)
    return x, y

# approach 2
def concatenate_2(data):
    for index, value in enumerate(data):
        if index == 0:
            x = pd.Panel(value[0])
            y = pd.Panel(value[1])
        else:
            x = pd.concat([x, pd.Panel(value[0])])
            y = pd.concat([y, pd.Panel(value[1])])
    return x, y

通过这种方式，读取文件要快得多。但是，在读取文件之后，级联似乎要花费大量时间。第二种方法似乎比第一种更快。但是，此方法在创建pd.Panel时显示弃用警告。

好吧，连接需要花费大量时间。我认为这个问题已经到来，因为我正在遍历joblib库返回的列表。

是否有更好的连接方法？

使用多处理读取和连接numpy数组

0 个答案: