我正在尝试将文件加载到TensorFlow数据集,其中某些文件可能会丢失(在这种情况下,我想将它们替换为零)。
我要从中读取数据的目录结构如下:
|-data
|---sensor_A
|-----1.dat
|-----2.dat
|-----3.dat
|---sensor_B
|-----1.dat
|-----2.dat
|-----3.dat
.dat
文件是.csv文件,以空格键作为分隔符。每个文件的内容都是一个单行多行观察,其中列数是恒定的(例如4),行数是未知的(时间序列数据)。
我已经成功地使用以下代码将每个传感器数据读取到单独的TensorFlow数据集:
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
dataset = tf.data.Dataset.from_tensor_slices((filenames,))
def _parse_function_internal(filename):
number_of_columns = 4
single_observation = tf.read_file(filename)
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break
成功为每个传感器打印出所有三个文件的内容。
我的问题是数据集中的某些时间戳可能会丢失。例如,如果1.dat
目录中的文件sensor_A
将丢失,则会出现此错误:
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: mock_data\sensor_A\1.dat : The system cannot find the file specified.
; No such file or directory
[[{{node ReadFile}}]] [Op:IteratorGetNextSync]
在此行中抛出的:
for el in dataset:
我试图做的是用try块包围对tf.read_file()
函数的调用,但是显然它不起作用,因为调用tf.read_file()
时不会抛出错误,但是当从数据集中获取值。稍后,我想将此数据集传递给Keras模型,这样我就不能仅用try块将其包围。有什么解决方法吗?甚至支持吗?
谢谢!
答案 0 :(得分:0)
我设法解决了这个问题,分享了解决方案,以防其他人也为之苦苦挣扎。我必须使用其他布尔值列表来指定文件是否实际存在,并将其传递给映射器。然后使用tf.cond()
函数,我们决定是读取文件还是用零(或任何其他逻辑)模拟数据。
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
files_exist = [os.path.isfile(filename) for filename in filenames]
dataset = tf.data.Dataset.from_tensor_slices((filenames, files_exist))
def _parse_function_internal(filename, file_exist):
number_of_columns = 4
single_observation = tf.cond(file_exist, lambda: tf.read_file(filename), lambda: ' '.join(['0.0'] * number_of_columns))
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break