从可能缺少某些文件的文件中读取数据集

时间:2019-07-17 21:24:46

标签: python tensorflow tensorflow-datasets

我正在尝试将文件加载到TensorFlow数据集,其中某些文件可能会丢失(在这种情况下,我想将它们替换为零)。

我要从中读取数据的目录结构如下:

   |-data
   |---sensor_A
   |-----1.dat
   |-----2.dat
   |-----3.dat
   |---sensor_B
   |-----1.dat
   |-----2.dat
   |-----3.dat

.dat文件是.csv文件,以空格键作为分隔符。每个文件的内容都是一个单行多行观察,其中列数是恒定的(例如4),行数是未知的(时间序列数据)。

我已经成功地使用以下代码将每个传感器数据读取到单独的TensorFlow数据集:

import os
import tensorflow as tf

tf.enable_eager_execution()

data_root_dir = "data"

modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]

for mod_idx, modality in enumerate(modalities_to_use):
    # Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
    filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]

    dataset = tf.data.Dataset.from_tensor_slices((filenames,))


    def _parse_function_internal(filename):
        number_of_columns = 4
        single_observation = tf.read_file(filename)
        # Tokenise every value so we can cast these to floats later.
        single_observation = tf.string_split([single_observation], sep='\r\n ').values
        single_observation = tf.reshape(single_observation, (-1, number_of_columns))
        single_observation = tf.strings.to_number(single_observation, tf.float32)
        return filename, single_observation

    dataset = dataset.map(_parse_function_internal)

    print('Result:')
    for el in dataset:
        try:
            # Filename
            print(el[0])
            # Parsed file content
            print(el[1])
        except tf.errors.OutOfRangeError:
            break

成功为每个传感器打印出所有三个文件的内容。

我的问题是数据集中的某些时间戳可能会丢失。例如,如果1.dat目录中的文件sensor_A将丢失,则会出现此错误:

tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: mock_data\sensor_A\1.dat : The system cannot find the file specified.
; No such file or directory
     [[{{node ReadFile}}]] [Op:IteratorGetNextSync]
在此行中抛出的

for el in dataset:

我试图做的是用try块包围对tf.read_file()函数的调用,但是显然它不起作用,因为调用tf.read_file()时不会抛出错误,但是当从数据集中获取值。稍后,我想将此数据集传递给Keras模型,这样我就不能仅用try块将其包围。有什么解决方法吗?甚至支持吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

我设法解决了这个问题,分享了解决方案,以防其他人也为之苦苦挣扎。我必须使用其他布尔值列表来指定文件是否实际存在,并将其传递给映射器。然后使用tf.cond()函数,我们决定是读取文件还是用零(或任何其他逻辑)模拟数据。

import os
import tensorflow as tf

tf.enable_eager_execution()

data_root_dir = "data"

modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]

for mod_idx, modality in enumerate(modalities_to_use):
    # Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
    filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
    files_exist = [os.path.isfile(filename) for filename in filenames]

    dataset = tf.data.Dataset.from_tensor_slices((filenames, files_exist))


    def _parse_function_internal(filename, file_exist):
        number_of_columns = 4
        single_observation = tf.cond(file_exist, lambda: tf.read_file(filename), lambda: ' '.join(['0.0'] * number_of_columns))
        # Tokenise every value so we can cast these to floats later.
        single_observation = tf.string_split([single_observation], sep='\r\n ').values
        single_observation = tf.reshape(single_observation, (-1, number_of_columns))
        single_observation = tf.strings.to_number(single_observation, tf.float32)
        return filename, single_observation

    dataset = dataset.map(_parse_function_internal)

    print('Result:')
    for el in dataset:
        try:
            # Filename
            print(el[0])
            # Parsed file content
            print(el[1])
        except tf.errors.OutOfRangeError:
            break