具有前填充或后填充选项的Papped_batch

时间:2019-10-23 14:44:58

标签: tensorflow dataset padding

我有一个可变长度序列的数据集(一个tensorflow TFRecord数据集)来提供LSTM网络,我想尝试比较批中的填充前和填充后,但是当前的padded_batch函数仅填充序列末尾。我知道我们在API中具有tf.keras.preprocessing.sequence.pad_sequences函数,但是我不知道如何将此函数应用于数据集批处理程序。 tensorflow中的padded_batch函数同时进行填充和批处理,并且它将动态找到每批所需的填充大小。我如何自己实现呢?现在,我的代码是这样的,我正在读取多个TFRecord文件并对其进行交织以形成混合数据集:

featuresDict = {'data': tf.FixedLenFeature([], dtype=tf.string),
                'rows': tf.FixedLenFeature([], dtype=tf.int64),
                'label': tf.FixedLenFeature([], dtype=tf.int64)
               }

def parse_tfrecord(example):
    features = tf.parse_single_example(example, featuresDict)
    label = tf.one_hot(features['label'],N)
    rows = features['rows']
    data = tf.decode_raw(features['data'], tf.int64)
    data = tf.reshape(data, (rows,num_features)
    return data, label

def read_datasets(pattern, numFiles, numEpochs=None, batchSize=None):
    files = tf.data.Dataset.list_files(pattern)

    def _parse(x):
        x = tf.data.TFRecordDataset(x, compression_type='GZIP')
        return x

    dataset = files.interleave(_parse, cycle_length=numFiles, block_length=1).map(parse_tfrecord)
    padded_shapes = (tf.TensorShape([None, num_features]), tf.TensorShape([N,])))
    dataset = dataset.padded_batch(batchSize, padded_shapes)
    dataset = dataset.prefetch(buffer_size=batchSize)
    dataset = dataset.repeat(numEpochs)
    return dataset

1 个答案:

答案 0 :(得分:0)

我和你有同样的问题,我也注意到你在张量流上也提出了一个问题。我设法用pad_sequences解决了这个问题,我想我解决了!!

import numpy as np
import tensorflow as tf


# Code snippet from https://www.tensorflow.org/guide/data


# Generator that generate the data
def gen_series():
    i = 0
    np.random.seed(0)
    while True:
        size = np.random.randint(0, 10)
        yield np.random.normal(size=(size, ))
        i += 1


# Transform the generator to Dataset
ds_series = tf.data.Dataset.from_generator(gen_series,
                                           output_types=(tf.float32),
                                           output_shapes=((None, )))
# output_shapes is (None, ) because the vector is unknown size

# Take first 5 samples

print("Before padding")
for vector in ds_series.take(5):
    print(vector)

print("*" * 10)


# Start to transform


def pad_session(session):
    """
        We pad the sequece in the pre-order with maxlen is 5.
        If any vector is larger than 5, we truncate the pre-sequence
    """
    return tf.keras.preprocessing.sequence.pad_sequences(
        [session.numpy()],
        maxlen=5,
        truncating="pre",
        padding='pre',
        value=0.0,
        dtype=np.float).squeeze()


def pad_map_fn(session):
    return tf.py_function(pad_session, inp=[session], Tout=(tf.float32))


padded_dataset = ds_series.map(pad_map_fn)

print("After padding")
for pre_padded_vector in padded_dataset.take(5):
    print(pre_padded_vector)

并会生成以下输出

Before padding
tf.Tensor([ 0.11849646  0.1139678   0.37025538  1.0405308  -1.5169828 ], shape=(5,), dtype=float32)
tf.Tensor(
[-0.8662762  -0.10321885  0.41059852  0.14404356  1.4542735   0.7610377
  0.12167501  0.44386324], shape=(8,), dtype=float32)
tf.Tensor([0.33367434 1.4143772 ], shape=(2,), dtype=float32)
tf.Tensor([-0.12405066  1.1682731   0.94718593], shape=(3,), dtype=float32)
tf.Tensor([-2.5529897], shape=(1,), dtype=float32)
**********
After padding
tf.Tensor([ 0.11849646  0.1139678   0.37025538  1.0405308  -1.5169828 ], shape=(5,), dtype=float32)
tf.Tensor([0.14404356 1.4542735  0.7610377  0.12167501 0.44386324], shape=(5,), dtype=float32)
tf.Tensor([0.         0.         0.         0.33367434 1.4143772 ], shape=(5,), dtype=float32)
tf.Tensor([ 0.          0.         -0.12405066  1.1682731   0.94718593], shape=(5,), dtype=float32)
tf.Tensor([ 0.         0.         0.         0.        -2.5529897], shape=(5,), dtype=float32)

我想引起您对pad_session的注意,我们将序列[session.numpy]传递给pad_sequences,因为我们需要将2维数组传递给它。

也许有更好的方法来解决它,但这是我得到的答案。

希望它将对您有帮助!