Question

在Google Cloud Shell中运行sample.sh脚本，按照花朵示例的步骤调用下面的图像集预处理。

https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/preprocess.py

预处理在eval set和train set上都成功完成。但生成的.tfrecord.gz文件似乎与eval / train_set.csv中的图像编号不匹配。

即。 eval-00000-of-00157.tfrecord.gz表示在eval_set.csv中有53227行有tfrecord 158。每条记录都包含一个有效的image_url（所有记录都上传到Storage），每条记录都标记有效标签。

想知道是否有一种方法可以监视和控制preproces.py配置中每个tfrecord的图像数量。

由于

更新，让这项工作正确：

import tensorflow as tf 
import os
from tensorflow.python.lib.io import file_io

options = tf.python_io.TFRecordOptions(
    compression_type=tf.python_io.TFRecordCompressionType.GZIP)

sum(1 for f in file_io.get_matching_files(os.path.join(url/path, '*.tfrecord.gz'))
    for example in tf.python_io.tf_record_iterator(f, options=options))

Answer 1

文件名eval-00000-of-00157.tfrecord.gz表示这是158中的第一个文件。应该有157个类似命名的文件。在每个文件中，可以有任意数量的记录。

如果您想手动计算每条记录，请尝试以下方法：

import tensorflow as tf
from tensorflow.python.lib.io import file_io

files = os.path.join('gs://my_bucket/my_dir', 'eval-*.tfrecord.gz')
print(sum(1 for f in tf.python_io.file_io.get_matching_files(files)
            for tf.python_io.tf_record_iterator(f)))

请注意，Dataflow无法保证输入文件和输出文件之间的文件数量和记录顺序（文件间和文件内）之间的关系。但是，计数应该是相同的。

每个tfrecord中的示例数

1 个答案: