在TensorFlow数据集中使用window()函数来访问多个行

时间:2019-07-02 18:04:33

标签: python tensorflow tensorflow-datasets

tf.data.experimental.CsvDataset从.csv文件读取的数据集转换为“时间序列”时遇到问题。

我想做的是一次访问数据集的多行,以便将前两行的特征附加到当前行并保留当前行的标签。我想对每一行都执行此操作(前两行除外)。我以为应用window()函数是正确的方法,但是现在我不确定。

通过读取如下一组.csv文件来创建包含大约300列的原始数据集:

ds = tf.data.experimental.CsvDataset(
            filenames,
            [tf.float32] * len(columns_indices_to_parse),
            header=True,
            select_cols=columns_indices_to_parse
        )

为了重现性,我使用了Dataset.from_tensor_slices()和Dataset.zip()的组合:

import tensorflow as tf

tf.enable_eager_execution()

with tf.Graph().as_default(), tf.Session() as sess:
    # Simulate what's being returned from CsvDataset():
    feature_1_ds = tf.data.Dataset.from_tensor_slices([1., 3., 5., 7., 9.])
    feature_2_ds = tf.data.Dataset.from_tensor_slices([2., 4., 6., 8., 10.])
    label_1_ds = tf.data.Dataset.from_tensor_slices([1.0, 1.0, 0.0, 1.0, 0.0])

    ds = tf.data.Dataset.zip((feature_1_ds, feature_2_ds, label_1_ds))

    # Do transformations to obtain "timeseries" data.
    def _parse_function_features(*row):
        features = tf.stack(row[:2], axis=-1)
        return features

    def _parse_function_labels(*row):
        labels = tf.stack(row[2:], axis=-1)
        return labels

    def _reshape(x):
        # Flatten rows into one.
        return tf.reshape(x, shape=[-1])

    ds_features = ds.map(_parse_function_features).window(3).flat_map(lambda x: x.batch(3)).map(_reshape)
    ds_labels = ds.map(_parse_function_labels).skip(2)
    ds = tf.data.Dataset.zip((ds_features, ds_labels))

    iter = ds.make_one_shot_iterator().get_next()
    # Show dataset contents
    print('Result:')
    while True:
        try:
            print(sess.run(iter))
        except tf.errors.OutOfRangeError:
            break

我仍然对window()转换有所了解,我看到了this GitHub问题,但是并不能解决我的问题。

我现在得到的是:

(array([1., 2., 3., 4., 5., 6.], dtype=float32), array([0.], dtype=float32))
(array([ 7.,  8.,  9., 10.], dtype=float32), array([1.], dtype=float32))

问题在于它的行为类似于批处理-处理三元组的行。我要实现的目标如下:

(array([1., 2., 3., 4., 5., 6.], dtype=float32), array([0.], dtype=float32)) # with label of the third row
(array([3., 4., 5., 6., 7., 8.], dtype=float32), array([1.], dtype=float32)) # with label of the fourth row
(array([5., 6., 7., 8., 9., 10.], dtype=float32), array([0.], dtype=float32)) # with label of the fifth row

我有点卡住,我不确定使用window()函数访问数据集的多行是否是正确的方法。之前我曾问过非常类似的问题,但是我删除了它,因为我认为我包含了太多的细节,在这里,我试图使它尽可能的简洁。任何帮助将不胜感激,谢谢!

1 个答案:

答案 0 :(得分:0)

从各个方面解决问题之后,我终于设法达到了所需的结果。我有两种解决方案:一种将特征和标签作为单独的数据集处理,另一种一次性应用到数据集的转换。两者都可能有用,具体取决于用例。

  1. 将要素和标签处理为单独的数据集:
import tensorflow as tf

tf.enable_eager_execution()

with tf.Graph().as_default(), tf.Session() as sess:
    # Simulate what's being returned from CsvDataset():
    feature_1_ds = tf.data.Dataset.from_tensor_slices([1., 3., 5., 7., 9.])
    feature_2_ds = tf.data.Dataset.from_tensor_slices([2., 4., 6., 8., 10.])
    label_1_ds = tf.data.Dataset.from_tensor_slices([1.0, 1.0, 0.0, 1.0, 0.0])

    ds = tf.data.Dataset.zip((feature_1_ds, feature_2_ds, label_1_ds))

    # Do transformations to obtain "timeseries" data.
    def _parse_function_features(*row):
        features = tf.stack(row[:2], axis=-1)
        return features

    def _parse_function_labels(*row):
        labels = tf.stack(row[2:], axis=-1)
        return labels

    def _reshape(x):
        # Flatten rows into one.
        return tf.reshape(x, shape=[-1])

    ds_features = ds.map(_parse_function_features).window(3, shift=1).flat_map(lambda x: x.batch(3)).map(_reshape)
    ds_labels = ds.map(_parse_function_labels).window(3, shift=1).flat_map(lambda x: x.skip(2))
    ds = tf.data.Dataset.zip((ds_features, ds_labels))

    iter = ds.make_one_shot_iterator().get_next()
    # Show dataset contents
    print('Result:')
    while True:
        try:
            print(sess.run(iter))
        except tf.errors.OutOfRangeError:
            break
  1. 一次性转换数据集:
import tensorflow as tf

tf.enable_eager_execution()

with tf.Graph().as_default(), tf.Session() as sess:
    # Simulate what's being returned from CsvDataset():
    feature_1_ds = tf.data.Dataset.from_tensor_slices([1., 3., 5., 7., 9.])
    feature_2_ds = tf.data.Dataset.from_tensor_slices([2., 4., 6., 8., 10.])
    label_1_ds = tf.data.Dataset.from_tensor_slices([1.0, 1.0, 0.0, 1.0, 0.0])

    ds = tf.data.Dataset.zip((feature_1_ds, feature_2_ds, label_1_ds))

    # Do transformations to obtain "timeseries" data.
    def _parse_function(*row):
        features = tf.stack(row[:2], axis=-1)
        labels = tf.stack(row[2:], axis=-1)
        return features, labels


    def _reshape(features, labels):
        # Flatten features into one row.
        return tf.reshape(features, shape=[-1]), labels


    ds = ds.map(_parse_function)
    ds = ds.window(3, shift=1)
    ds = ds.flat_map(lambda x, y: tf.data.Dataset.zip((x.batch(3), y.skip(2))))
    ds = ds.map(_reshape)

    iter = ds.make_one_shot_iterator().get_next()
    # Show dataset contents
    print('Result:')
    while True:
        try:
            print(sess.run(iter))
        except tf.errors.OutOfRangeError:
            break

对于这两者,输出为:

Result:
(array([1., 2., 3., 4., 5., 6.], dtype=float32), array([0.], dtype=float32))
(array([3., 4., 5., 6., 7., 8.], dtype=float32), array([1.], dtype=float32))
(array([ 5.,  6.,  7.,  8.,  9., 10.], dtype=float32), array([0.], dtype=float32))