我有一个可变长度序列的数据集(一个tensorflow TFRecord数据集)来提供LSTM网络,我想尝试比较批中的填充前和填充后,但是当前的padded_batch函数仅填充序列末尾。我知道我们在API中具有tf.keras.preprocessing.sequence.pad_sequences
函数,但是我不知道如何将此函数应用于数据集批处理程序。 tensorflow中的padded_batch函数同时进行填充和批处理,并且它将动态找到每批所需的填充大小。我如何自己实现呢?现在,我的代码是这样的,我正在读取多个TFRecord文件并对其进行交织以形成混合数据集:
featuresDict = {'data': tf.FixedLenFeature([], dtype=tf.string),
'rows': tf.FixedLenFeature([], dtype=tf.int64),
'label': tf.FixedLenFeature([], dtype=tf.int64)
}
def parse_tfrecord(example):
features = tf.parse_single_example(example, featuresDict)
label = tf.one_hot(features['label'],N)
rows = features['rows']
data = tf.decode_raw(features['data'], tf.int64)
data = tf.reshape(data, (rows,num_features)
return data, label
def read_datasets(pattern, numFiles, numEpochs=None, batchSize=None):
files = tf.data.Dataset.list_files(pattern)
def _parse(x):
x = tf.data.TFRecordDataset(x, compression_type='GZIP')
return x
dataset = files.interleave(_parse, cycle_length=numFiles, block_length=1).map(parse_tfrecord)
padded_shapes = (tf.TensorShape([None, num_features]), tf.TensorShape([N,])))
dataset = dataset.padded_batch(batchSize, padded_shapes)
dataset = dataset.prefetch(buffer_size=batchSize)
dataset = dataset.repeat(numEpochs)
return dataset
答案 0 :(得分:0)
我和你有同样的问题,我也注意到你在张量流上也提出了一个问题。我设法用pad_sequences
解决了这个问题,我想我解决了!!
import numpy as np
import tensorflow as tf
# Code snippet from https://www.tensorflow.org/guide/data
# Generator that generate the data
def gen_series():
i = 0
np.random.seed(0)
while True:
size = np.random.randint(0, 10)
yield np.random.normal(size=(size, ))
i += 1
# Transform the generator to Dataset
ds_series = tf.data.Dataset.from_generator(gen_series,
output_types=(tf.float32),
output_shapes=((None, )))
# output_shapes is (None, ) because the vector is unknown size
# Take first 5 samples
print("Before padding")
for vector in ds_series.take(5):
print(vector)
print("*" * 10)
# Start to transform
def pad_session(session):
"""
We pad the sequece in the pre-order with maxlen is 5.
If any vector is larger than 5, we truncate the pre-sequence
"""
return tf.keras.preprocessing.sequence.pad_sequences(
[session.numpy()],
maxlen=5,
truncating="pre",
padding='pre',
value=0.0,
dtype=np.float).squeeze()
def pad_map_fn(session):
return tf.py_function(pad_session, inp=[session], Tout=(tf.float32))
padded_dataset = ds_series.map(pad_map_fn)
print("After padding")
for pre_padded_vector in padded_dataset.take(5):
print(pre_padded_vector)
并会生成以下输出
Before padding
tf.Tensor([ 0.11849646 0.1139678 0.37025538 1.0405308 -1.5169828 ], shape=(5,), dtype=float32)
tf.Tensor(
[-0.8662762 -0.10321885 0.41059852 0.14404356 1.4542735 0.7610377
0.12167501 0.44386324], shape=(8,), dtype=float32)
tf.Tensor([0.33367434 1.4143772 ], shape=(2,), dtype=float32)
tf.Tensor([-0.12405066 1.1682731 0.94718593], shape=(3,), dtype=float32)
tf.Tensor([-2.5529897], shape=(1,), dtype=float32)
**********
After padding
tf.Tensor([ 0.11849646 0.1139678 0.37025538 1.0405308 -1.5169828 ], shape=(5,), dtype=float32)
tf.Tensor([0.14404356 1.4542735 0.7610377 0.12167501 0.44386324], shape=(5,), dtype=float32)
tf.Tensor([0. 0. 0. 0.33367434 1.4143772 ], shape=(5,), dtype=float32)
tf.Tensor([ 0. 0. -0.12405066 1.1682731 0.94718593], shape=(5,), dtype=float32)
tf.Tensor([ 0. 0. 0. 0. -2.5529897], shape=(5,), dtype=float32)
我想引起您对pad_session
的注意,我们将序列[session.numpy]
传递给pad_sequences
,因为我们需要将2维数组传递给它。
也许有更好的方法来解决它,但这是我得到的答案。
希望它将对您有帮助!