Question

我正在寻找有关如何处理“大”数据集的特征工程的建议

我正在使用LSTM网络来解决时间序列预测任务。我可以首先将整个数据集加载到内存（.csv文件）中，然后进行初始预处理。

预处理后，我的数据集的形状为（88038，100）

现在我使用以下功能来生成用于训练和验证的数据集

我们将为网络提供672个过去的观测值（672，100），以预测下一个96（96，）

lstm_data的结果应该产生 x =形状（87270、672、100） y =形状（87270，96）

def lstm_data(dataset, target, start_index, history_size,
                  target_size):
    data = []
    labels = []

    start_index = start_index + history_size
    end_index = len(dataset) - target_size

    for i in range(start_index, end_index):

        indices = range(i-history_size, i)
        data.append(dataset[indices])
        labels.append(target[i:i+target_size])


    return np.array(data), np.array(labels)

问题在于数据集对于内存来说太大了，程序崩溃了。

我认为我应该将数据卸载到磁盘上的TFRecord文件中，但找不到如何实现该示例。我想着类似的东西

def lstm_data(dataset, target, start_index, history_size,
                  target_size):

    start_index = start_index + history_size
    end_index = len(dataset) - target_size

    writer = tf.python_io.TFRecordWriter('my_data.tfrecords')

    for i in range(start_index, end_index):
         indices = range(i-history_size, i)

         x = dataset[indices]
         # x has shape (672, 49) and has to be flattened
         # original shape will be recovered when data is loaded back? 
         x = x.flatten()

         y = target[i:i+target_size]

         x_example = tf.train.Example(features=tf.train.Features(feature={
    'x': tf.train.Feature(int64_list=tf.train.FloatList(value=x)),
    'label': tf.train.Feature(int64_list=tf.train.FloatList(value=y))
    }))

         writer.write(x_example.SerializeToString())

大数据集的处理特征工程

0 个答案: