Question

我正在训练一个神经网络，其中大约有5千兆字节的数据存储为numpy数组。数据被分成100000行的块，我以随机顺序对所有块进行了六个周期的训练。不幸的是，网络已经开始过度适应。我认为它仍然有能力更紧密地拟合数据;我怀疑每个块内的内部规则开始相互矛盾，我需要更彻底地改变数据，以便它可以训练不同的组合。我想在获得更多训练数据之前尝试这个。

有谁知道生成360万（非常长）numpy数据行的新排列的好方法？我考虑过使用one of these技术，但使用numpy.savetxt编写这些数组会产生令人难以置信的大文件，我无法告诉如何操作标准{{1文件以一种有助于解决此问题的方式。

现在，我最好的想法是在数据中创建配对索引npy的排列，其中(c, r)选择一个块，c从该块中选择一行。我可以将每一行存储在一个新的预分配数组中，然后保存它。但我想知道是否有一个不那么可怕的I / O限制解决方案。是否有一些原则性的方法可以将随机对的块组合在一起，直到你得到一个统计上独立于起始排列的排列？

Answer 1

在我迄今为止尝试的内容中，PyTables解决方案目前是最好的，其次是使用numpy对memmapped数组的支持的解决方案。但PyTables解决方案并不简单。如果你使用一个混洗的整数数组来直接索引PyTables数组，那么速度非常慢。以下两步过程要快得多：

使用布尔索引数组选择数组的随机子集。 必须以chunkwise的方式完成。如果将索引数组直接传递给PyTables数组，则速度很慢。
- 预分配一个numpy数组并创建一个将PyTables数组拆分为块的切片列表。
- 将每个块完全读入内存，然后使用索引数组的相应块为该块选择正确的值。
- 将选定的值存储在预分配的数组中。
然后随机播放预分配的数组。

此过程产生的排列与正常的重排过程一样随机。如果这看起来不明显，请考虑这一点：(n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)!。这种方法足够快，可以为每个训练周期进行随机播放。它还能够将数据压缩到约650M - 几乎90％的通货紧缩。

这是我目前的实施情况;对于语料库中的每个训练块，都会调用一次。（返回的数组在其他地方被洗牌。）

def _h5_fast_bool_ix(self, h5_array, ix, read_chunksize=100000):
    '''Iterate over an h5 array chunkwise to select a random subset
    of the array. `h5_array` should be the array itself; `ix` should
    be a boolean index array with as many values as `h5_array` has
    rows; and you can optionally set the number of rows to read per
    chunk with `read_chunksize` (default is 100000). For some reason
    this is much faster than using `ix` to index the array directly.'''

    n_chunks = h5_array.shape[0] / read_chunksize
    slices = [slice(i * read_chunksize, (i + 1) * read_chunksize)
              for i in range(n_chunks)]

    a = numpy.empty((ix.sum(), h5_array.shape[1]), dtype=float)
    a_start = 0
    for sl in slices:
        chunk = h5_array[sl][ix[sl]]
        a_end = a_start + chunk.shape[0]
        a[a_start:a_end] = chunk
        a_start = a_end

    return a

对我来说，O（n ^ 2）方法（在每个块上遍历整个PyTables数组）在这种情况下比O（n）方法（在一次通过中随机选择每一行）更快。但是，嘿，它有效。有了更多的间接性，这可以适用于加载任意非随机排列，但这增加了比这里更值得的复杂性。

mmap解决方案可供参考，供那些因任何原因需要纯粹的numpy解决方案的人使用。它在大约25分钟内将所有数据洗牌，而上述解决方案在不到一半的时间内管理相同的数据。这也应该线性扩展，因为mmap允许（相对）有效的随机访问。

import numpy
import os
import random

X = []
Y = []

for filename in os.listdir('input'):
    X.append(numpy.load(os.path.join('input', filename), mmap_mode='r'))

for filename in os.listdir('output'):
    Y.append(numpy.load(os.path.join('output', filename), mmap_mode='r'))

indices = [(chunk, row) for chunk, rows in enumerate(X) 
                        for row in range(rows.shape[0])]
random.shuffle(indices)

newchunks = 50
newchunksize = len(indices) / newchunks

for i in range(0, len(indices), newchunksize):
    print i
    rows = [X[chunk][row] for chunk, row in indices[i:i + newchunksize]]
    numpy.save('X_shuffled_' + str(i), numpy.array(rows))
    rows = [Y[chunk][row] for chunk, row in indices[i:i + newchunksize]]
    numpy.save('Y_shuffled_' + str(i), numpy.array(rows))

Answer 2

以下假设您的数据已经分为某种易于检索的记录。（我不知道numpy数据是否有标准文件格式。）

以dict的形式创建数据索引，将每个唯一记录ID（0到 n - 1）映射到某些再次查找数据的方法。例如，如果它都在一个二进制文件中，那么您将存储一个(file_offset, record_length)形式的元组。无需保留数据本身。
创建一个 n 元素列表，其中包含索引dict的键（同样，0到 n - 1）。
随机播放记录ID列表。（如果需要，请提供您自己的随机数生成器。）
打开一个新文件（或其他）以包含随机数据。
从头到尾读取列表中的记录ID。对于每个记录ID，在索引中查找该记录的位置。抓取该位置的数据并将其附加到输出文件。

的伪代码：

# This assumes a binary file of unequal-length
# records.  It also assumes that the file won't
# be changed while we're doing this.

# Create index.
index = {}
rec_offset = 0
for rec_id, record in original_data.iterate_records():
    # This bit depends greatly on how your data
    # is stored...
    rec_length = len(record)
    index[rec_id] = (rec_offset, rec_length)
    rec_offset += rec_length

# Shuffle.
num_records_indexed = rec_id + 1  # rec_id is still in scope.
records_order = list(range(num_records_indexed))
records_order = random.shuffle(records_order, "<optional_RNG_here>")

# Create new shuffled-data file.
with open("output_file.bin", "wb") as output:
    for rec_id in records_order:
        rec_offset, rec_length = index[rec_id]
        record = original_data.get_rec_at(rec_offset, rec_length)
        output.write(record)

索引，混洗和取消索引都是O（ n ），所以最糟糕的部分应该是I / O：读取数据然后复制它（第二次读取，再加上写入））。

统一调整5千兆字节的numpy数据

2 个答案: