Tensorflow tf.data.Dataset API,数据集解压缩功能?

时间:2018-12-05 22:45:26

标签: tensorflow tensorflow-datasets

在tensorflow 1.12中有一个Dataset.zip函数:记录在here中。

但是,我想知道是否存在一个数据集解压缩函数,该函数将返回原始的两个数据集。

# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
a = { 1, 2, 3 }
b = { 4, 5, 6 }
c = { (7, 8), (9, 10), (11, 12) }
d = { 13, 14 }

# The nested structure of the `datasets` argument determines the
# structure of elements in the resulting dataset.
Dataset.zip((a, b)) == { (1, 4), (2, 5), (3, 6) }
Dataset.zip((b, a)) == { (4, 1), (5, 2), (6, 3) }

# The `datasets` argument may contain an arbitrary number of
# datasets.
Dataset.zip((a, b, c)) == { (1, 4, (7, 8)),
                            (2, 5, (9, 10)),
                            (3, 6, (11, 12)) }

# The number of elements in the resulting dataset is the same as
# the size of the smallest dataset in `datasets`.
Dataset.zip((a, d)) == { (1, 13), (2, 14) }

我想要以下内容

dataset = Dataset.zip((a, d)) == { (1, 13), (2, 14) }
a, d = dataset.unzip()

4 个答案:

答案 0 :(得分:4)

我的解决方法是仅使用map,但不确定稍后是否对unzip的语法糖功能感兴趣。

a = dataset.map(lambda a, b: a)
b = dataset.map(lambda a, b: b)

答案 1 :(得分:1)

在黄欧文的答案的基础上,此函数似乎适用于任意数据集:

def split_datasets(dataset):
    subsets = {}
    names = list(dataset.output_shapes.keys())
    for name in names:
        subsets[name] = dataset.map(lambda x: x[name])

    return subsets

答案 2 :(得分:0)

第二次回应黄欧文的答案:

对于TensorFlow 2,如果出现错误,请使用以下方法:

a = dataset.interleave(lambda a,b: tf.data.Dataset.from_tensors(a))
b = dataset.interleave(lambda a,b: tf.data.Dataset.from_tensors(b))

与地图相同。关键是使用tf.data.Dataset.from_tensors

答案 3 :(得分:0)

我为 tf.data.Dataset 管道编写了一个更通用的解压缩函数,它还可以处理递归情况,其中管道具有多级压缩。

import tensorflow as tf


def tfdata_unzip(
    tfdata: tf.data.Dataset,
    *,
    recursive: bool=False,
    eager_numpy: bool=False,
    num_parallel_calls: int=tf.data.AUTOTUNE,
):
    """
    Unzip a zipped tf.data pipeline.

    Args:
        tfdata: the :py:class:`tf.data.Dataset`
            to unzip.

        recursive: Set to ``True`` to recursively unzip
            multiple layers of zipped pipelines.
            Defaults to ``False``.

        eager_numpy: Set this to ``True`` to return
            Python lists of primitive types or
            :py:class:`numpy.array` objects. Defaults
            to ``False``.

        num_parallel_calls: The level of parallelism to
            each time we ``map()`` over a
            :py:class:`tf.data.Dataset`.

    Returns:
        Returns a Python list of either
             :py:class:`tf.data.Dataset` or NumPy
             arrays.
    """
    if isinstance(tfdata.element_spec, tf.TensorSpec):
        if eager_numpy:
            return list(tfdata.as_numpy_iterator())
        return tfdata
        
    
    def tfdata_map(i: int) -> list:
        return tfdata.map(
            lambda *cols: cols[i],
            deterministic=True,
            num_parallel_calls=num_parallel_calls,
        )

    if isinstance(tfdata.element_spec, tuple):
        num_columns = len(tfdata.element_spec)
        if recursive:
            return [
                tfdata_unzip(
                    tfdata_map(i),
                    recursive=recursive,
                    eager_numpy=eager_numpy,
                    num_parallel_calls=num_parallel_calls,
                )
                for i in range(num_columns)
            ]
        else:
            return [
                tfdata_map(i)
                for i in range(num_columns)
            ]

    raise ValueError(
        "Unknown tf.data.Dataset element_spec: " +
        str(tfdata.element_spec)
    )

给出这些示例数据集,tfdata_unzip() 的工作原理如下:

>>> import numpy as np

>>> baby = tf.data.Dataset.from_tensor_slices([
    np.array([1,2]),
    np.array([3,4]),
    np.array([5,6]),
])
>>> baby.element_spec
TensorSpec(shape=(2,), dtype=tf.int64, name=None)
TensorSpec(shape=(2,), dtype=tf.int64, name=None)

>>> parent = tf.data.Dataset.zip((baby, baby))
>>> parent.element_spec
(TensorSpec(shape=(2,), dtype=tf.int64, name=None),
 TensorSpec(shape=(2,), dtype=tf.int64, name=None))

>>> grandparent = tf.data.Dataset.zip((parent, parent))
>>> grandparent.element_spec
((TensorSpec(shape=(2,), dtype=tf.int64, name=None),
  TensorSpec(shape=(2,), dtype=tf.int64, name=None)),
 (TensorSpec(shape=(2,), dtype=tf.int64, name=None),
  TensorSpec(shape=(2,), dtype=tf.int64, name=None)))

这是 tfdata_unzip() 在上述 babyparentgrandparent 数据集上返回的结果:

>>> tfdata_unzip(baby)
<TensorSliceDataset shapes: (2,), types: tf.int64>

>>> tfdata_unzip(parent)
[<ParallelMapDataset shapes: (2,), types: tf.int64>,
 <ParallelMapDataset shapes: (2,), types: tf.int64>]

>>> tfdata_unzip(grandparent)
[<ParallelMapDataset shapes: ((2,), (2,)), types: (tf.int64, tf.int64)>,
 <ParallelMapDataset shapes: ((2,), (2,)), types: (tf.int64, tf.int64)>]

>>> tfdata_unzip(grandparent, recursive=True)
[[<ParallelMapDataset shapes: (2,), types: tf.int64>,
  <ParallelMapDataset shapes: (2,), types: tf.int64>],
 [<ParallelMapDataset shapes: (2,), types: tf.int64>,
  <ParallelMapDataset shapes: (2,), types: tf.int64>]]

>>> tfdata_unzip(grandparent, recursive=True, eager_numpy=True)
[[[array([1, 2]), array([3, 4]), array([5, 6])],
  [array([1, 2]), array([3, 4]), array([5, 6])]],
 [[array([1, 2]), array([3, 4]), array([5, 6])],
  [array([1, 2]), array([3, 4]), array([5, 6])]]]