我正在为时序分类问题实现tf.data输入管道;数据集本身是按时间顺序排列的记录,记录了50个要素的> 2M;对于任何给定时间的k个大小的窗口,以对t进行单个类别的预测。 训练进行了256次完整数据集滑动。

用于<500 000 num的集合。记录和窗口大小为128,明显的解决方案是准备大小为500000x128x50的冗余嵌入式矩阵,并在第一个暗点上进行随机采样。每一批;但是,它对实际数据大小没有帮助。


t-k, t-k+1, ..., t






import tensorflow as tf
import numpy as np
import time


time_embed = 128
batch_size = 512
num_features = 50
data_size = 100000

shuffle_buffer_size = batch_size * 100

# Input placeholders (ingest numpy arrays):
input_data = tf.placeholder(tf.float32, shape=[None, num_features])
labels = tf.placeholder(tf.float32, shape=[None,])

input_tensors = {'x': input_data, 'y': labels}

# Dictionary of datasets:
ds_struct = {key: tf.data.Dataset.from_tensor_slices(tz) for key, tz in input_tensors.items()}

# Make time embedding windows:
ds_windowed = {
    key: ds.window(time_embed, 1, 1, drop_remainder=True).flat_map(lambda x: x.batch(time_embed))
    for key, ds in ds_struct.items()

# Zip into single dataset:
ds = tf.data.Dataset.zip(ds_windowed)

# Note that shapes of tensors are set to be dynamic except last dimension:
print('structured embedded dataset:\n', ds)

ds = ds.shuffle(shuffle_buffer_size)
ds = ds.batch(batch_size)

# Make reinitialisable iterator to feed different data (train, eval, test) into input_tensors:
iterator = ds.make_initializable_iterator()

batch = iterator.get_next()
print('batch:\n', batch)

# For many-to-one prediction discard all but last embedded targets,
# We also need to explicitly reshape x to get static shape sufficient to pass to some estimator:
truncated_batch = {
    'y': batch['y'][:, -1],
    'x': tf.reshape(batch['x'],[-1, time_embed, batch['x'].shape[-1]])

print('final batch:\n', truncated_batch)

