Question

我有数千个二进制文件，我必须阅读并存储在内存中以处理数据。我已经有一个允许读取这些数据的函数，但我想改进它，因为它有点慢。

数据以这种方式组织：

1000个立方体。
每个多维数据集用10个二进制文件写。

目前我有一个读取函数可以读取并返回一个numpy数组中的一个多维数据集（read_1_cube）。然后我循环遍历所有文件以提取所有多维数据集并将其连接起来。

def read_1_cube( dataNum ):
    ### read the 10 subfiles and concatenate arrays
    N_subfiles = 10
    fames_subfiles = ( '%d_%d'%(dataNum,k) for k in range(N_subfiles) )
    return np.concatenate( [np.fromfile( open(fn,'rb'), dtype=float, count=N*N*N ).reshape((N,N,N)) for fn in fames_subfiles], axis=2 )

TotDataNum = 1000
my_full_data = np.concatenate( [read_1_cube( d ) for d in range( TotDataNum )], axis=0 )

我尝试使用生成器来限制使用的内存量。有了这些功能，每个文件需要大约2.5s，所以1000个文件需要45分钟，最后我会有10000个文件，所以它不可行（当然，我不会读取10000个文件，但钢铁我如果1000个文件需要1小时，则无法工作。

我的问题：

您知道一种优化read_1_cube和生成my_full_data的方法吗？
你确实看到了更好的方式（没有read_1_cube）？
另一种优化方式：你知道是否有一个可以在生成器上工作的连接函数（如sum（），min（），max（），list（）...）？

编辑：在@liborm关于np.concatenate的评论之后，我找到了其他等效函数（stack concatenate question）：np.r_, np.stack, np.hstack。好处是堆栈可以在输入中使用生成器。所以我尽可能用生成器推送，只在最后创建实际的数据数组。

def read_1_cube( dataNum ):
    ### read the 10 subfiles and retur cube generator
    N_subfiles = 10
    fames_subfiles = ( '%d_%d'%(dataNum,k) for k in range(N_subfiles) )
    return (np.fromfile( open(fn,'rb'), dtype=float, count=N*N*N ).reshape((N,N,N)) for fn in fames_subfiles)

def read_N_cube( datanum ):
    ### make a generator of 'cube generator'
    C = ( np.stack( read_1_cube( d ), axis=2 ).reshape((N,N,N*10)) for d in range(datanum) )
    return np.stack( C ).reshape( (datanum*N,N,N*N_subfiles) )

### The full allocation is done here, just once
my_full_data = read_N_cube( datanum )

它比第一个版本更快，第一个版本需要2.4s才能读取1个文件，第二个版本需要6.2才能读取10个文件！

我认为没有太多优化的地方，但我确信那里还有更好的算法！

Answer 1

为了获得良好的性能（通常），您希望尽可能少地分配 - 这应该事先分配大数组，然后在读取期间分配每个小数组。使用stack或concatenate可能会（重新）分配内存并复制数据......

我没有数据来测试它，认为这是一个伪代码＆＃39;：

def read_one(d, i):
    fn = '%d_%d' % (d, i)
    return np.fromfile(open(fn,'rb'), dtype=float, count=N*N*N).reshape((N,N,N))

res = np.zeros((N * TotDataNum, N, N * N_subfiles))
for dat in range(TotDataNum):
    ax0 = N * dat
    for idx in range(N_subfiles):
        ax2 = N * idx
        res[ax0:ax0+N, :, ax2:ax2+N] = read_one(dat, idx)

Answer 2

我还没有完全理解为什么你的改变加快了速度，但我怀疑它是否使用了stack。

测试一下：

In [30]: N=50
In [31]: arr = np.arange(N*N).reshape(N,N)
In [32]: np.stack((arr for _ in range(N*10))).shape
Out[32]: (500, 50, 50)
In [33]: np.concatenate([arr for _ in range(N*10)]).shape
Out[33]: (25000, 50)

次：

In [34]: timeit np.stack((arr for _ in range(N*10))).shape
100 loops, best of 3: 2.45 ms per loop
In [35]: timeit np.stack([arr for _ in range(N*10)]).shape
100 loops, best of 3: 2.43 ms per loop
In [36]: timeit np.concatenate([arr for _ in range(N*10)]).shape
1000 loops, best of 3: 1.56 ms per loop

在stack中使用生成器理解没有任何优势。 concatenate没有采用发电机，但速度更快。

np.stack只是concatenate的便利封面。它的代码是：

arrays = [asanyarray(arr) for arr in arrays]
...
expanded_arrays = [arr[sl] for arr in arrays]
return _nx.concatenate(expanded_arrays, axis=axis)

它对输入参数进行了2次列表推导，一次确保它们是数组，再次添加一个维度。这就解释了为什么它接受一个生成器列表，以及为什么它更慢。

concatenate已编译numpy代码，并且需要＆＃39;序列＆＃39;：

TypeError: The first input argument needs to be a sequence

生成器和读取文件优化

2 个答案: